Project Setup: Installing the EnvironmentΒΆ
This first code cell will automatically check your Python environment for all required packages (like pandas, sklearn, torch, and xgboost). If any are missing, it will install them using pip.
Please run this cell and wait for it to complete.
- If it prints "All required packages are already installed," you are ready to proceed.
- If it installs new packages, you may need to restart the kernel before running the rest of the notebook. (In Jupyter:
Kernel>Restart Kernel...)
import sys
import subprocess
from importlib.metadata import distributions # <-- This replaces pkg_resources
# List of required packages
required_packages = {
"numpy",
"pandas",
"matplotlib",
"seaborn",
"statsmodels",
"scikit-learn",
"xgboost",
"torch",
"fredapi",
"tqdm",
}
# Installation Script
print("Checking environment...")
installed_packages = {dist.metadata["Name"].lower() for dist in distributions()}
missing_packages = required_packages - installed_packages
if not missing_packages:
print("All required packages are already installed. Environment is ready.")
else:
print(f"Installing missing packages: {', '.join(missing_packages)}")
python = sys.executable
# Run pip install for each missing package
for package in missing_packages:
print(f"Installing {package}...")
try:
subprocess.check_call(
[python, "-m", "pip", "install", package],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
)
except subprocess.CalledProcessError as e:
print(f"ERROR: Failed to install {package}. Please install it manually.")
print(e)
print("\nEnvironment setup complete. You may need to restart the kernel.")
Checking environment... Installing missing packages: seaborn, matplotlib, xgboost, statsmodels, scikit-learn, fredapi Installing seaborn... Installing matplotlib... Installing xgboost... Installing statsmodels... Installing scikit-learn... Installing fredapi... Environment setup complete. You may need to restart the kernel.
# This cell contains all the main libraries used throughout our analysis.
# Standard Libraries
import copy
import math
import os
from itertools import cycle
# Data Manipulation & Numerics
import numpy as np
import pandas as pd
# Data Visualization
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import seaborn as sns
# Stats & Econometrics
from fredapi import Fred
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.stats.contingency_tables import mcnemar
from statsmodels.tsa.stattools import acf
# Machine Learning (Scikit-learn)
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import mutual_info_classif
from sklearn.linear_model import (
LogisticRegressionCV,
MultiTaskElasticNetCV,
MultiTaskLassoCV,
RidgeCV,
)
from sklearn.metrics import (
accuracy_score,
confusion_matrix,
mean_squared_error,
r2_score,
)
from sklearn.model_selection import (
GridSearchCV,
TimeSeriesSplit,
train_test_split,
)
from sklearn.multioutput import MultiOutputClassifier, MultiOutputRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Machine Learning (PyTorch & XGBoost)
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from xgboost import XGBClassifier
# Utilities
from tqdm.auto import tqdm
/opt/python/lib/python3.13/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
import random
# Set the master seed
SEED = 42
def set_all_seeds(seed):
"""Sets the seed for all major sources of randomness."""
# 1. Python's built-in random module
random.seed(seed)
# 2. NumPy
np.random.seed(seed)
# 3. PyTorch (for CPU and GPU)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed) # if you are using multi-GPU
# These are required for full reproducibility on CU-DNN
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# 4. Set Python Hash Seed (for dictionary key, set, etc. iteration)
os.environ["PYTHONHASHSEED"] = str(seed)
print(f"Global seeds for random, numpy, and torch set to {seed}.")
if torch.cuda.is_available():
print(
"PyTorch CUDA seeds set. Note: cuDNN deterministic mode is on, which may impact performance."
)
set_all_seeds(SEED)
Global seeds for random, numpy, and torch set to 42. PyTorch CUDA seeds set. Note: cuDNN deterministic mode is on, which may impact performance.
1. IntroductionΒΆ
The U.S. yield curve is a cornerstone of global finance, serving as the primary tool for pricing assets, managing risk, and conducting asset-liability management (ALM). Given its importance, yield curve forecasting is a central challenge in quantitative finance. While traditional econometric models like the Dynamic Nelson-Siegel are foundational, modern machine learning (ML) offers a promising alternative, capable of handling high-dimensional datasets and capturing complex, non-linear patterns that linear models may miss.
This notebook documents a comprehensive, empirical study comparing the efficacy of classical and modern ML techniques for this task. Our research followed a two-stage process. We first tested a series of regularized linear regressions (Ridge, Lasso, ElasticNet) on high-frequency daily data. This approach, however, failed to produce any meaningful out-of-sample (OOS) predictive power, with models showing severe overfitting.
Based on these findings, we pivoted our methodology to test a more tractable hypothesis. We hypothesized that a simpler problem definition would yield a clearer signal. We therefore:
- Aggregated the data to a weekly frequency to filter daily noise.
- Simplified the task from regression to binary classification, predicting only the direction of weekly yield changes.
On this new weekly classification framework, we trained and evaluated a comprehensive suite of models:
- Logistic Regression (with L2-CV)
- Random Forest
- XGBoost
- Long Short-Term Memory (LSTM) Network
The models are trained using a rigorous rolling-window framework (15-year train, 4-week forecast) and evaluated using statistical significance tests (McNemar's) to validate their performance against a naive benchmark. This project's goal is to identify the most robust modeling technique, feature set, and methodology for this complex forecasting challenge.
2. Data Collection & PreprocessingΒΆ
2.1. FREDΒΆ
To capture the macroeconomic and financial drivers of the yield curve, we source exogenous predictors from the Federal Reserve Economic Data (FRED) repository, maintained by the Research Division of the Federal Reserve Bank of St. Louis. FRED is widely recognized as the premier open-access database for U.S. economic time series.
api_key = "615add3bd441c0ed42dec47e952d69e2"
fred_api = Fred(api_key=api_key)
This section defines the complete set of predictors (features) for our yield curve forecasting models. The features are selected based on established economic theory and are identified by their FRED ticker. There are 94 of them in total.
The selection provides a comprehensive view of the macro-financial environment by covering all major drivers of interest rates:
Production & Income: Leading indicators for economic health.
Labor Market: Employment/unemployment data, a key dual-mandate component.
Housing & Consumption: Interest-rate sensitive sectors.
Money & Credit: Monetary aggregates and bank lending activity.
Price Indexes: Key inflation measures (CPI, PCE) monitored by the Fed.
Fiscal Variables: Government spending, debt, and deficit data.
Domestic Yields: The full US Treasury curve, serving as autoregressive features.
International Yields: "Safe haven" yields from other major economies.
Stock Market & FX: Broader market sentiment and capital flows.
Risk & Volatility: Measures of market stress and oil prices.
Monetary Policy (Traditional): The target Fed Funds Rate.
Monetary Policy (Unconventional): Fed balance sheet data (QE/QT).
Inflation Expectations: Market-based forward inflation expectations.
To ensure maximum information, all variables were extracted from the FRED database at their highest available frequency. This resulted in a mixed-frequency dataset:
Daily: Most financial markets data
Monthly: Most macroeconomic variables
Quarterly: Most fiscal variables
Limitations and Missing Data
It's important to note that some key features, such as Gold and the S&P 500, were not available in this initial FRED extraction. These will need to be sourced from an alternative database to complete the feature set.
# --- Master Feature Set & Sample Period Definition ---
fred_ticker = [
# 1. Production and Income
"RPI", # Real Personal Income
"INDPRO", # Industrial Production Index
# 2. Labor Market
"UNRATE", # Unemployment Rate
"UEMP5TO14", # Short-Term Unemployment (5-14 wks)
"UEMP27OV", # Long-Term Unemployment (27+ wks)
"PAYEMS", # Total Non-Farm Payroll
"USGOOD", # Employment: Goods-Producing
"CES1021000001", # Employment: Mining
"USCONS", # Employment: Construction
"MANEMP", # Employment: Manufacturing
"DMANEMP", # Employment: Durable Goods
"NDMANEMP", # Employment: Non-Durable Goods
"SRVPRD", # Employment: Private Service-Providing
"USTPU", # Employment: Trade, Transport, Utilities
"USWTRADE", # Employment: Wholesale Trade
"USTRADE", # Employment: Retail Trade
"USFIRE", # Employment: Financial Activities
"USGOVT", # Employment: Government
# 3. Consumption and Housing
"HOUST", # Housing Starts
"PERMIT", # Building Permits
"UMCSENT", # U. of Michigan Consumer Sentiment
# 4. Money and Credit
"M1SL", # M1 Money Stock
"M2SL", # M2 Money Stock
"M2REAL", # Real M2 Money Stock
"TOTRESNS", # Total Bank Reserves
"NONBORRES", # Non-Borrowed Bank Reserves
"BUSLOANS", # Commercial & Industrial Loans
"REALLN", # Real Estate Loans (All)
"NONREVSL", # Non-Revolving Consumer Credit
"DTCOLNVHFNM", # Non-Financial Corporate Liabilities
"DTCTHFNM", # Household Liabilities
"INVEST", # Gross Private Domestic Investment
# 5. Price Indexes (Inflation)
"CPIAUCSL", # CPI (All Urban Consumers)
"CPIAPPSL", # CPI: Apparel
"CPITRNSL", # CPI: Transportation
"CPIMEDSL", # CPI: Medical Care
"CUSR0000SAC", # CPI: Commodities
"CUUR0000SAD", # CPI: Durables
"CUSR0000SAS", # CPI: Services
"CPIULFSL", # CPI: All Items Less Food
"CUUR0000SA0L2", # CPI: All Items Less Shelter
"CUSR0000SA0L5", # CPI: All Items Less Food & Energy
"PCEPI", # PCE: Price Index (Fed's Target)
"DDURRG3M086SBEA", # PCE: Durable Goods
"DNDGRG3M086SBEA", # PCE: Non-Durable Goods
"DSERRG3M086SBEA", # PCE: Services
# 6. Fiscal Variables
"FGEXPND", # Federal Government Expenditures (Q)
"FGRECPT", # Federal Government Receipts (Q)
"FGDEF", # Federal Government Deficit (Q)
"FGCE", # Federal Government Spending (Q)
"MTSDS133FMS", # Federal Surplus/Deficit (M)
"FYGFDPUN", # Federal Debt Held by Public (Q)
"GFDEGDQ188S", # Debt-to-GDP Ratio (Q)
"A091RC1Q027SBEA", # Net Interest Payments (Q)
"GFDEBTN", # Total Public Debt Outstanding (M)
# 7. Domestic Yield Variables
"DGS1MO",
"DGS3MO",
"DGS6MO", # T-Bills (< 1Y)
"DGS1",
"DGS2",
"DGS3",
"DGS5",
"DGS7",
"DGS10",
"DGS20",
"DGS30", # T-Notes/Bonds (>= 1Y)
# 8. International Yields
"IRLTLT01DEM156N", # 10Y German Bund
"IRLTLT01JPM156N", # 10Y Japanese Govt Bond
"IRLTLT01GBM156N", # 10Y UK Gilt
"IRLTLT01CAM156N", # 10Y Canadian Govt Bond
"IRLTLT01AUM156N", # 10Y Australian Govt Bond
"IRLTLT01FRM156N", # 10Y French Govt Bond
# 9. Stock Markets, FX, and Corporate Bonds
"NASDAQCOM", # NASDAQ Composite Index
"AAA", # Moody's Seasoned AAA Corporate Bond Yield
"BAA", # Moody's Seasoned BAA Corporate Bond Yield
"DEXUSEU", # USD / EUR
"DEXJPUS", # JPY / USD
"DEXUSUK", # USD / GBP
"DEXCAUS", # CAD / USD
"DEXUSAL", # AUD / USD
"DEXCHUS", # CNY / USD
# 10. Risk & Commodity Variables
"VIXCLS", # CBOE Volatility Index (VIX)
"NFCI", # National Financial Conditions Index
"DCOILWTICO", # WTI Crude Oil Price
# 11. Traditional Monetary Policy
"FEDFUNDS", # Effective Federal Funds Rate
# 12. Unconventional Monetary Policy (QE/QT)
"WRESBAL", # Bank Reserves at Fed
"CURRCIR", # Currency in Circulation
"WTREGEN", # US Treasury General Account
"BOGMBASE", # Monetary Base
"WSHOSHO", # Total Assets of the Federal Reserve (Balance Sheet Size)
"WSHOTSL", # Fed Holdings: U.S. Treasury Securities
"WSHOBL", # Fed Holdings: T-Bonds & T-Notes (> 1Y)
# 13. Market Inflation Expectations
"T5YIE", # 5-Year Breakeven Inflation Rate
"T10YIE", # 10-Year Breakeven Inflation Rate
]
Sample Period and Data Frequency : The 'start_date' is set to September 2002, as this date provides the best compromise for data availability across our large feature set. Many series have missing or incomplete data before this point.
- The sample period for this dataset spans from 2003-01-02 to 2025-08-01.
start_date = "2002-09-01"
end_date = "2025-08-01"
fred = {}
for feature in fred_ticker:
print(feature)
try:
series = fred_api.get_series(
feature, observation_start=start_date, observation_end=end_date
)
fred[feature] = series
except ValueError:
print(f"Could not retrieve data for {feature}.")
continue
Helper Function: Stationarity & Transformation Check
This utility visualizes four key transformations to determine the optimal input for modeling.
- Original: Baseline check for trends, seasonality, and outliers.
- Diff ($\Delta X_t$): Removes linear trends to test for mean reversion (stationarity).
- Log ($\ln X_t$): Stabilizes growing variance (heteroscedasticity) often seen in financial data.
- Log-Diff ($\Delta \ln X_t$): Approximates percentage returns. Combines trend removal and variance stabilization; usually the target state for modeling.
def plot_transformations(ts_dict):
"""
Plots Original, Diff, Log, and Log-Diff transformations.
Args:
ts_dict: Dictionary of time series (pandas Series).
"""
for key, ts in ts_dict.items():
# 1. Data Prep
# Drop NaNs explicitly to avoid plotting gaps
ts_clean = ts.dropna()
# 2. Transformations
ts_diff1 = ts_clean.diff()
# SAFETY: Log of negative numbers or zero is undefined (-inf/nan).
# We use .where() to filter, so we don't crash the script.
ts_log = np.log(ts_clean.where(ts_clean > 0))
ts_log_diff = ts_log.diff()
# 3. Plotting
fig, axes = plt.subplots(1, 4, figsize=(24, 5), constrained_layout=True)
fig.suptitle(f"Stationarity Check: {key}", fontsize=16, weight="bold")
# Helper to style each subplot consistently
def style_plot(ax, data, title, color):
ax.plot(data, color=color, linewidth=1.5)
ax.set_title(title, fontsize=12, weight="bold")
ax.grid(True, linestyle="--", alpha=0.5) # Grids help read values
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
style_plot(axes[0], ts_clean, "Original Level", "#1f77b4") # Blue
style_plot(
axes[1], ts_diff1, r"First Difference ($\Delta$)", "#ff7f0e"
) # Orange
style_plot(axes[2], ts_log, r"Log Level ($ln$)", "#2ca02c") # Green
style_plot(axes[3], ts_log_diff, r"Log Returns ($\Delta ln$)", "#d62728") # Red
plt.show()
plt.close(fig)
plot_transformations(fred)
After visually inspecting the time series plots for all features, we identified non-stationary behaviors (such as trends and varying variance) in many variables. To prepare the dataset for modeling, we applied specific transformations to induce stationarity:
- Log-Difference (
logdiff): Applied to variables showing exponential growth or trends (e.g., Production, Employment, Money Supply). - Difference (
diff): Applied to interest rates, yields, and ratios that are already in percentage terms. - Remove: Variables identified as redundant, excessively noisy, or lacking clear predictive signal were excluded from the feature set.
# Variables to Log-Difference (Growth Rates)
vars_logdiff = [
"RPI",
"INDPRO",
"UNRATE",
"UEMP5TO14",
"PAYEMS",
"USGOOD",
"USCONS",
"MANEMP",
"DMANEMP",
"NDMANEMP",
"SRVPRD",
"USTPU",
"USWTRADE",
"USTRADE",
"USFIRE",
"USGOVT",
"HOUST",
"PERMIT",
"UMCSENT",
"M1SL",
"M2SL",
"M2REAL",
"TOTRESNS",
"NONREVSL",
"DTCOLNVHFNM",
"DTCTHFNM",
"CPIAUCSL",
"CPIAPPSL",
"CPITRNSL",
"CUSR0000SAC",
"CUUR0000SAD",
"CPIULFSL",
"CUUR0000SA0L2",
"PCEPI",
"DNDGRG3M086SBEA",
"FGEXPND",
"GFDEGDQ188S",
"GFDEBTN",
"NASDAQCOM",
"DEXUSEU",
"DEXJPUS",
"DEXUSUK",
"DEXCAUS",
"DEXUSAL",
"VIXCLS",
"DCOILWTICO",
"FEDFUNDS",
"CURRCIR",
"BOGMBASE",
"WSHOSHO",
"WSHOTSL",
]
# Variables to Difference (Changes in Levels/Rates)
vars_diff = [
"FGDEF",
"MTSDS133FMS",
"FYGFDPUN",
"DGS1MO",
"DGS3MO",
"DGS6MO",
"DGS1",
"DGS2",
"DGS3",
"DGS5",
"DGS7",
"DGS10",
"DGS20",
"DGS30",
"IRLTLT01DEM156N",
"IRLTLT01JPM156N",
"IRLTLT01GBM156N",
"IRLTLT01CAM156N",
"IRLTLT01AUM156N",
"IRLTLT01FRM156N",
"AAA",
"BAA",
"NFCI",
"T5YIE",
"T10YIE",
]
# Variables to Remove
vars_remove = [
"UEMP27OV",
"CES1021000001",
"NONBORRES",
"BUSLOANS",
"REALLN",
"INVEST",
"CPIMEDSL",
"CUSR0000SAS",
"CUSR0000SA0L5",
"DDURRG3M086SBEA",
"DSERRG3M086SBEA",
"FGRECPT",
"FGCE",
"A091RC1Q027SBEA",
"DEXCHUS",
"WRESBAL",
"WTREGEN",
"WSHOBL",
]
dic_modif = {}
for var in vars_logdiff:
dic_modif[var] = ["logdiff"]
for var in vars_diff:
dic_modif[var] = ["diff"]
for var in vars_remove:
dic_modif[var] = ["remove"]
for key, value in dic_modif.items():
if key in fred.keys():
if value == ["remove"]:
del fred[key]
elif value == ["logdiff"]:
# Apply Log-Difference (Growth Rate)
# Note: np.log will result in -inf/NaN if data <= 0
fred[key] = np.log(fred[key]).diff()
elif value == ["diff"]:
# Apply Simple Difference (Change in Level)
fred[key] = fred[key].diff()
/opt/python/lib/python3.13/site-packages/pandas/core/arraylike.py:399: RuntimeWarning: invalid value encountered in log result = getattr(ufunc, method)(*inputs, **kwargs)
To align features with varying reporting frequencies (weekly, monthly, quarterly) to our modeling timeline, we employ a Forward Fill (Last Observation Carried Forward) strategy. This ensures that at any time t, the model utilizes the most recently disseminated data, preserving the chronological flow of information.
Additionally, we apply a Backward Fill to impute missing values at the very beginning of the sample (early January 2003). This step does not introduce look-ahead bias, as the underlying time series contain historical data predating our analysis start date (January 8th, 2003), meaning these values were historically available at that time.
df_fred = pd.concat(fred, axis=1)
df_fred.columns = fred.keys()
df_fred = df_fred.ffill()
df_fred = df_fred.bfill()
df_fred = df_fred[df_fred.index >= "2003-04-01"]
df_fred.head()
| RPI | INDPRO | UNRATE | UEMP5TO14 | PAYEMS | USGOOD | USCONS | MANEMP | DMANEMP | NDMANEMP | ... | VIXCLS | NFCI | DCOILWTICO | FEDFUNDS | CURRCIR | BOGMBASE | WSHOSHO | WSHOTSL | T5YIE | T10YIE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003-04-01 | 0.004111 | -0.006278 | 0.016807 | 0.035964 | -0.000484 | -0.003744 | 0.005246 | -0.007706 | -0.008612 | -0.006242 | ... | -0.027475 | -0.01957 | -0.054781 | 0.007968 | 0.005315 | 0.006078 | 0.001565 | 0.001565 | -0.01 | 0.02 |
| 2003-04-02 | 0.004111 | -0.006278 | 0.016807 | 0.035964 | -0.000484 | -0.003744 | 0.005246 | -0.007706 | -0.008612 | -0.006242 | ... | -0.012061 | -0.01957 | -0.032055 | 0.007968 | 0.005315 | 0.006078 | 0.000906 | 0.000906 | 0.02 | 0.01 |
| 2003-04-03 | 0.004111 | -0.006278 | 0.016807 | 0.035964 | -0.000484 | -0.003744 | 0.005246 | -0.007706 | -0.008612 | -0.006242 | ... | 0.006758 | -0.01957 | 0.017362 | 0.007968 | 0.005315 | 0.006078 | 0.000906 | 0.000906 | -0.02 | -0.02 |
| 2003-04-04 | 0.004111 | -0.006278 | 0.016807 | 0.035964 | -0.000484 | -0.003744 | 0.005246 | -0.007706 | -0.008612 | -0.006242 | ... | 0.032092 | -0.02332 | -0.022277 | 0.007968 | 0.005315 | 0.006078 | 0.000906 | 0.000906 | -0.01 | -0.01 |
| 2003-04-07 | 0.004111 | -0.006278 | 0.016807 | 0.035964 | -0.000484 | -0.003744 | 0.005246 | -0.007706 | -0.008612 | -0.006242 | ... | -0.023620 | -0.02332 | -0.023145 | 0.007968 | 0.005315 | 0.006078 | 0.000906 | 0.000906 | -0.02 | 0.02 |
5 rows Γ 76 columns
2.2. Complementary Data : GOLD and S&P 500ΒΆ
2.2.2. GoldΒΆ
Daily gold USD spot prices are recovered via : https://www.investing.com/currencies/xau-usd-historical-data.
data_path = os.path.join("data", "US", "xau_usd.csv")
data_path2 = os.path.join("data", "US", "xau_usd2.csv")
df_gold = pd.concat(
[
pd.read_csv(data_path2, index_col=0, thousands=","),
pd.read_csv(data_path, index_col=0, thousands=","),
]
)
In this step, we transform the raw Gold spot price into Log-Returns to ensure stationarity for our models.
# Reverse to chronological order
df_gold = df_gold.iloc[::-1]
df_gold.index = pd.to_datetime(df_gold.index)
# Calculate Log Returns
df_gold["XAU_LogReturns"] = np.log(df_gold["Price"]).diff()
# Plotting
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(
df_gold.index,
df_gold["XAU_LogReturns"],
color="goldenrod",
linewidth=0.8,
alpha=0.9,
label="Daily Log Returns",
)
ax.set_title(
"Gold Price: Daily Log Returns (Volatility Structure)", fontsize=14, weight="bold"
)
ax.set_ylabel(r"Log Return ($\Delta \ln P$)", fontsize=12)
ax.set_xlabel("Date", fontsize=12)
# Grid and clean borders
ax.grid(True, which="major", linestyle="--", linewidth=0.5, alpha=0.7)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
plt.tight_layout()
plt.show()
df_gold = df_gold[["XAU_LogReturns"]]
df_gold.head()
| XAU_LogReturns | |
|---|---|
| Date | |
| 2000-01-03 | NaN |
| 2000-01-04 | -0.022925 |
| 2000-01-05 | -0.005147 |
| 2000-01-06 | 0.000818 |
| 2000-01-07 | 0.001847 |
2.2.2. S&P 500ΒΆ
We get daily index level data on the S&P 500 through: https://wrds-www.wharton.upenn.edu
data_path = os.path.join("data", "US", "sp500.csv")
df_sp500 = pd.read_csv(data_path, index_col=0)
df_sp500.head()
| vwretd | vwretx | ewretd | ewretx | sprtrn | spindx | |
|---|---|---|---|---|---|---|
| DlyCalDt | ||||||
| 2000-01-03 | -0.006803 | -0.006810 | 0.002878 | 0.002860 | -0.009549 | 1455.22 |
| 2000-01-04 | -0.039652 | -0.039679 | -0.017465 | -0.017486 | -0.038345 | 1399.42 |
| 2000-01-05 | -0.000935 | -0.001009 | 0.007821 | 0.007743 | 0.001922 | 1402.11 |
| 2000-01-06 | -0.007391 | -0.007547 | 0.004504 | 0.004453 | 0.000956 | 1403.45 |
| 2000-01-07 | 0.032516 | 0.032514 | 0.017008 | 0.016991 | 0.027090 | 1441.47 |
Similar to the commodity data, we transform the S&P 500 index into Log-Returns to induce stationarity.
# Calculate Log Returns
df_sp500["SP500_LogReturns"] = np.log(df_sp500["spindx"]).diff()
df_sp500.index = pd.to_datetime(df_sp500.index)
# Plotting
fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(
df_sp500.index,
df_sp500["SP500_LogReturns"],
color="#003366", # Dark Navy Blue
linewidth=0.8,
alpha=0.9,
label="Daily Log Returns",
)
# Styling
ax.set_title(
"S&P 500: Daily Log Returns (Volatility Structure)", fontsize=14, weight="bold"
)
ax.set_ylabel(r"Log Return ($\Delta \ln P$)", fontsize=12)
ax.set_xlabel("Date", fontsize=12)
ax.grid(True, axis="y", which="major", linestyle="--", linewidth=0.5, alpha=0.7)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.xaxis.set_major_locator(mdates.AutoDateLocator(minticks=5, maxticks=10))
ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y"))
plt.setp(ax.get_xticklabels(), rotation=30, ha="right")
plt.tight_layout()
plt.show()
df_sp500 = df_sp500[["SP500_LogReturns"]]
2.3. Merge DatasetsΒΆ
dus = df_fred.merge(
right=df_gold, how="inner", left_index=True, right_index=True
).merge(right=df_sp500, how="inner", left_index=True, right_index=True)
dus.index = pd.to_datetime(dus.index)
dus.head()
| RPI | INDPRO | UNRATE | UEMP5TO14 | PAYEMS | USGOOD | USCONS | MANEMP | DMANEMP | NDMANEMP | ... | DCOILWTICO | FEDFUNDS | CURRCIR | BOGMBASE | WSHOSHO | WSHOTSL | T5YIE | T10YIE | XAU_LogReturns | SP500_LogReturns | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003-04-01 | 0.004111 | -0.006278 | 0.016807 | 0.035964 | -0.000484 | -0.003744 | 0.005246 | -0.007706 | -0.008612 | -0.006242 | ... | -0.054781 | 0.007968 | 0.005315 | 0.006078 | 0.001565 | 0.001565 | -0.01 | 0.02 | -0.007018 | 0.012071 |
| 2003-04-02 | 0.004111 | -0.006278 | 0.016807 | 0.035964 | -0.000484 | -0.003744 | 0.005246 | -0.007706 | -0.008612 | -0.006242 | ... | -0.032055 | 0.007968 | 0.005315 | 0.006078 | 0.000906 | 0.000906 | 0.02 | 0.01 | -0.015396 | 0.025781 |
| 2003-04-03 | 0.004111 | -0.006278 | 0.016807 | 0.035964 | -0.000484 | -0.003744 | 0.005246 | -0.007706 | -0.008612 | -0.006242 | ... | 0.017362 | 0.007968 | 0.005315 | 0.006078 | 0.000906 | 0.000906 | -0.02 | -0.02 | -0.014099 | -0.005064 |
| 2003-04-04 | 0.004111 | -0.006278 | 0.016807 | 0.035964 | -0.000484 | -0.003744 | 0.005246 | -0.007706 | -0.008612 | -0.006242 | ... | -0.022277 | 0.007968 | 0.005315 | 0.006078 | 0.000906 | 0.000906 | -0.01 | -0.01 | 0.000983 | 0.002735 |
| 2003-04-07 | 0.004111 | -0.006278 | 0.016807 | 0.035964 | -0.000484 | -0.003744 | 0.005246 | -0.007706 | -0.008612 | -0.006242 | ... | -0.023145 | 0.007968 | 0.005315 | 0.006078 | 0.000906 | 0.000906 | -0.02 | 0.02 | -0.008324 | 0.001228 |
5 rows Γ 78 columns
dp_daily = os.path.join("data", "US", "us_data.parquet")
dus.to_parquet(dp_daily)
Let's also create a weekly frequence dataset
dus_weekly = dus.resample("W-FRI").last()
dus_weekly.head()
| RPI | INDPRO | UNRATE | UEMP5TO14 | PAYEMS | USGOOD | USCONS | MANEMP | DMANEMP | NDMANEMP | ... | DCOILWTICO | FEDFUNDS | CURRCIR | BOGMBASE | WSHOSHO | WSHOTSL | T5YIE | T10YIE | XAU_LogReturns | SP500_LogReturns | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003-04-04 | 0.004111 | -0.006278 | 0.016807 | 0.035964 | -0.000484 | -0.003744 | 0.005246 | -0.007706 | -0.008612 | -0.006242 | ... | -0.022277 | 0.007968 | 0.005315 | 0.006078 | 0.000906 | 0.000906 | -0.01 | -0.01 | 0.000983 | 0.002735 |
| 2003-04-11 | 0.004111 | -0.006278 | 0.016807 | 0.035964 | -0.000484 | -0.003744 | 0.005246 | -0.007706 | -0.008612 | -0.006242 | ... | 0.038938 | 0.007968 | 0.005315 | 0.006078 | 0.005599 | 0.005600 | -0.02 | -0.02 | 0.005132 | -0.003770 |
| 2003-04-18 | 0.004111 | -0.006278 | 0.016807 | 0.035964 | -0.000484 | -0.003744 | 0.005246 | -0.007706 | -0.008612 | -0.006242 | ... | 0.031727 | 0.007968 | 0.005315 | 0.006078 | 0.002649 | 0.002649 | 0.00 | 0.00 | 0.002295 | 0.015416 |
| 2003-04-25 | 0.004111 | -0.006278 | 0.016807 | 0.035964 | -0.000484 | -0.003744 | 0.005246 | -0.007706 | -0.008612 | -0.006242 | ... | -0.059898 | 0.007968 | 0.005315 | 0.006078 | 0.000164 | 0.000164 | 0.00 | 0.02 | -0.003593 | -0.013943 |
| 2003-05-02 | 0.007335 | -0.000344 | 0.016529 | -0.011338 | 0.000184 | -0.001327 | 0.002538 | -0.003429 | -0.003777 | -0.002866 | ... | -0.011972 | 0.000000 | 0.005032 | 0.005761 | 0.000587 | 0.000587 | 0.04 | 0.00 | -0.002050 | 0.014927 |
5 rows Γ 78 columns
dp_weekly = os.path.join("data", "US", "us_data_weekly.parquet")
dus_weekly.to_parquet(dp_weekly)
3. Exploratory Data AnalysisΒΆ
3.1. Importing daily and weekly dataΒΆ
dus_daily = pd.read_parquet(dp_daily)
dus_weekly = pd.read_parquet(dp_weekly)
columns = dus_daily.columns
n_features = len(columns)
ncols = 6
nrows = int(np.ceil(n_features / ncols))
# Create Figure & Subplots
fig, axes = plt.subplots(
nrows=nrows, ncols=ncols, figsize=(ncols * 5, nrows * 4), constrained_layout=True
)
fig.suptitle(
"Distribution Analysis for All Features",
fontsize=24,
weight="bold",
)
# We use axes.flatten() to make iterating easy
for i, ax in enumerate(axes.flatten()):
if i < n_features:
col_name = columns[i]
data = dus_daily[col_name].dropna()
ax.hist(
data,
bins="auto",
color="#1f77b4",
edgecolor="white",
linewidth=0.5,
)
# Styling
ax.set_title(col_name, fontsize=12, weight="bold")
ax.set_xlabel("Value", fontsize=10)
ax.set_ylabel("Frequency", fontsize=10)
ax.grid(True, axis="y", linestyle="--", alpha=0.6)
# Remove the top and right "box lines"
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
else:
ax.set_visible(False)
plt.show()
3.2. Exploration and Feature EngineeringΒΆ
3.2.1. Daily DataΒΆ
Let's start by analyzing our targets' autocorrelations and partial autocorrelations.
yields = [
"DGS1MO",
"DGS3MO",
"DGS6MO",
"DGS1",
"DGS2",
"DGS3",
"DGS5",
"DGS7",
"DGS10",
"DGS20",
"DGS30",
]
for y in yields:
series = dus_daily[y]
fig, axes = plt.subplots(1, 2, figsize=(8, 4))
plot_acf(series, lags=90, title=f"ACF {y}", ax=axes[0])
plot_pacf(series, lags=90, title=f"PACF {y}", ax=axes[1])
for ax in axes:
ax.set_xlim(0.5, 90)
ax.set_ylim(-0.2, 0.2)
plt.show()
Short-Term Yields: These show strong, persistent serial correlation. The autocorrelation function (ACF) decays slowly, meaning returns from numerous past lags (up to 30 days) are significantly correlated with near-future returns.
Long-Term Yields: These show a much faster ACF decay. Only the most immediate past returns (t-1, t-2) show a slight correlation with the next day's return. Surprisingly, we also see a persistent, anomalous correlation at more distant, specific lags (t-25 and t-50).
Let's add some new features : lagged yields. To capture the different "memories" of the yield curve, we engineered a maturity-dependent lag structure for our features:
For Short-Term Maturities (< 1 Year): A dense set of 10 lags was used to model their complex, high-frequency dynamics: t-1, t-2, t-5, t-10, t-15, t-20, t-25, t-30, t-40, and t-50.
For Long-Term Maturities (>= 1 Year): A sparser, more parsimonious set of 5 lags was selected: t-1, t-2, t-10, t-25, and t-50.
for lag in [1, 2, 5, 10, 15, 20, 25, 30, 40, 50]:
dus_daily[f"DGS1MO_t-{lag}"] = dus_daily["DGS1MO"].shift(lag)
dus_daily[f"DGS3MO_t-{lag}"] = dus_daily["DGS3MO"].shift(lag)
dus_daily[f"DGS6MO_t-{lag}"] = dus_daily["DGS6MO"].shift(lag)
dus_daily[f"DGS1_t-{lag}"] = dus_daily["DGS1"].shift(lag)
for lag in [1, 2, 10, 15, 25, 50]:
dus_daily[f"DGS1_t-{lag}"] = dus_daily["DGS1"].shift(lag)
dus_daily[f"DGS2_t-{lag}"] = dus_daily["DGS2"].shift(lag)
dus_daily[f"DGS3_t-{lag}"] = dus_daily["DGS3"].shift(lag)
dus_daily[f"DGS5_t-{lag}"] = dus_daily["DGS5"].shift(lag)
dus_daily[f"DGS7_t-{lag}"] = dus_daily["DGS7"].shift(lag)
dus_daily[f"DGS10_t-{lag}"] = dus_daily["DGS10"].shift(lag)
dus_daily[f"DGS20_t-{lag}"] = dus_daily["DGS20"].shift(lag)
dus_daily[f"DGS30_t-{lag}"] = dus_daily["DGS30"].shift(lag)
Let's set up the targets to do some feature selection. Note that former yield columns can then be dropped.
dus_daily["Y_DGS1MO"] = dus_daily["DGS1MO"].shift(-1)
dus_daily["Y_DGS3MO"] = dus_daily["DGS3MO"].shift(-1)
dus_daily["Y_DGS6MO"] = dus_daily["DGS6MO"].shift(-1)
dus_daily["Y_DGS1"] = dus_daily["DGS1"].shift(-1)
dus_daily["Y_DGS2"] = dus_daily["DGS2"].shift(-1)
dus_daily["Y_DGS3"] = dus_daily["DGS3"].shift(-1)
dus_daily["Y_DGS5"] = dus_daily["DGS5"].shift(-1)
dus_daily["Y_DGS7"] = dus_daily["DGS7"].shift(-1)
dus_daily["Y_DGS10"] = dus_daily["DGS10"].shift(-1)
dus_daily["Y_DGS20"] = dus_daily["DGS20"].shift(-1)
dus_daily["Y_DGS30"] = dus_daily["DGS30"].shift(-1)
dus_daily = dus_daily.drop(
columns=[
"DGS1MO",
"DGS3MO",
"DGS6MO",
"DGS1",
"DGS2",
"DGS3",
"DGS5",
"DGS7",
"DGS10",
"DGS20",
"DGS30",
]
)
Let's explore the correlation heatmap between our time series.
plt.figure(figsize=(25, 25))
sns.heatmap(dus_daily.corr(), cmap="seismic", center=0)
Our preliminary analysis revealed that most features have a low individual correlation with the target yield directions. To build a more parsimonious model and reduce noise, we implemented a two-tiered feature selection process based on absolute correlation coefficients:
Lagged Features: We set a higher threshold of 0.05. Any lagged feature with an absolute correlation below this value for all target variables was removed.
Other Features: We used a more lenient threshold of 0.03. This distinction was critical, as a uniform 0.05 filter was found to be overly aggressive, removing theoretically important predictors such as the S&P 500, gold, and the VIX.
Furthermore, we observed significant multicollinearity among the remaining lagged features. To address this, we will also test model variants that incorporate Principal Component Analysis (PCA) within the pipeline. This step transforms the correlated lagged features into a smaller set of orthogonal components, which should improve model stability and reduce overfitting.
Y = dus_daily[["Y_" + y for y in yields]]
Y.head()
| Y_DGS1MO | Y_DGS3MO | Y_DGS6MO | Y_DGS1 | Y_DGS2 | Y_DGS3 | Y_DGS5 | Y_DGS7 | Y_DGS10 | Y_DGS20 | Y_DGS30 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003-04-01 | 0.00 | 0.01 | 0.02 | 0.06 | 0.10 | 0.11 | 0.11 | 0.13 | 0.10 | 0.09 | 0.09 |
| 2003-04-02 | 0.00 | -0.02 | -0.02 | -0.04 | -0.05 | -0.04 | -0.02 | -0.03 | -0.01 | 0.01 | 0.01 |
| 2003-04-03 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.03 | 0.03 | 0.02 | 0.04 |
| 2003-04-04 | 0.02 | 0.06 | 0.07 | 0.07 | 0.10 | 0.10 | 0.11 | 0.07 | 0.07 | 0.04 | 0.04 |
| 2003-04-07 | 0.00 | -0.02 | -0.02 | -0.04 | -0.06 | -0.07 | -0.08 | -0.08 | -0.08 | -0.08 | -0.08 |
dus_daily_lagged_features = dus_daily[
[col for col in dus_daily.columns if "_t-" in col]
]
dus_daily_other = dus_daily.drop(
columns=[col for col in dus_daily.columns if "_t-" in col]
)
corrs = pd.DataFrame(
{target: dus_daily_lagged_features.corrwith(Y[target]) for target in Y.columns}
).abs()
mask = (corrs < 0.05).all(axis=1)
low_corr_lagged_features = corrs.index[mask]
dus_daily_filtered = dus_daily.drop(columns=low_corr_lagged_features)
corrs = pd.DataFrame(
{target: dus_daily_other.corrwith(Y[target]) for target in Y.columns}
).abs()
mask = (corrs < 0.03).all(axis=1)
low_corr_other_features = corrs.index[mask]
dus_daily_filtered = dus_daily_filtered.drop(columns=low_corr_other_features)
dus_daily_filtered = dus_daily_filtered.dropna()
print(
f"{len(low_corr_lagged_features)} lagged features were removed, namely:\
\n {low_corr_lagged_features.values},\nand {len(low_corr_other_features)}\
other features were deleted, namely:\n{low_corr_other_features.values}."
)
65 lagged features were removed, namely: ['DGS1_t-1' 'DGS1MO_t-5' 'DGS1_t-5' 'DGS1_t-10' 'DGS1MO_t-15' 'DGS3MO_t-15' 'DGS6MO_t-15' 'DGS1_t-15' 'DGS1MO_t-20' 'DGS3MO_t-20' 'DGS6MO_t-20' 'DGS1_t-20' 'DGS1MO_t-25' 'DGS3MO_t-25' 'DGS6MO_t-25' 'DGS1_t-25' 'DGS6MO_t-30' 'DGS1_t-30' 'DGS1_t-40' 'DGS1MO_t-50' 'DGS3MO_t-50' 'DGS6MO_t-50' 'DGS1_t-50' 'DGS2_t-1' 'DGS3_t-1' 'DGS5_t-1' 'DGS7_t-1' 'DGS10_t-1' 'DGS20_t-1' 'DGS30_t-1' 'DGS2_t-2' 'DGS3_t-2' 'DGS5_t-2' 'DGS7_t-2' 'DGS10_t-2' 'DGS20_t-2' 'DGS30_t-2' 'DGS2_t-10' 'DGS3_t-10' 'DGS5_t-10' 'DGS7_t-10' 'DGS10_t-10' 'DGS20_t-10' 'DGS30_t-10' 'DGS2_t-15' 'DGS3_t-15' 'DGS5_t-15' 'DGS7_t-15' 'DGS10_t-15' 'DGS20_t-15' 'DGS30_t-15' 'DGS2_t-25' 'DGS3_t-25' 'DGS5_t-25' 'DGS7_t-25' 'DGS10_t-25' 'DGS20_t-25' 'DGS30_t-25' 'DGS2_t-50' 'DGS3_t-50' 'DGS5_t-50' 'DGS7_t-50' 'DGS10_t-50' 'DGS20_t-50' 'DGS30_t-50'], and 26 other features were deleted, namely: ['RPI' 'INDPRO' 'UNRATE' 'UEMP5TO14' 'PAYEMS' 'SRVPRD' 'USTPU' 'USTRADE' 'USGOVT' 'HOUST' 'M1SL' 'NONREVSL' 'DTCOLNVHFNM' 'DTCTHFNM' 'CUUR0000SAD' 'FGEXPND' 'FGDEF' 'FYGFDPUN' 'GFDEBTN' 'DEXUSEU' 'DEXJPUS' 'DEXUSUK' 'VIXCLS' 'DCOILWTICO' 'CURRCIR' 'WSHOTSL'].
Let's plot the correlation heatmap once again now that the dataframe is partly filtered.
plt.figure(figsize=(25, 25))
sns.heatmap(dus_daily_filtered.corr(), cmap="seismic", center=0)
Let's analyze the results of a PCA on this dataset, setting the threshold to 0.95% of expalined variance.
scaler = StandardScaler()
X = dus_daily_filtered.drop(columns=Y)
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=0.95, random_state=SEED)
X_pca = pca.fit_transform(X_scaled)
print(
f"The cumulative explained variance is:\n{pca.explained_variance_ratio_.cumsum()}"
)
print(
f"It follows that {pca.n_components_} components are kept to explain 0.95 of the variance."
)
The cumulative explained variance is: [0.20063178 0.30765502 0.37692745 0.43118364 0.47970699 0.52083965 0.55851157 0.59130551 0.62189081 0.64949084 0.67587116 0.70135488 0.72535087 0.74670924 0.76544079 0.78283976 0.79778514 0.81148186 0.82468399 0.83765874 0.85015094 0.86229613 0.87367327 0.88420911 0.89411451 0.90398253 0.91344715 0.920782 0.92783535 0.93410597 0.93977142 0.94526377 0.9503372 ] It follows that 33 components are kept to explain 0.95 of the variance.
Let's visualize the top features contribution to explain the variance of our daily dataset.
pc1, pc2 = np.abs(pca.components_[:2])
importance = pc1 + pc2
n = 30
top_idx = np.argsort(importance)[-n:]
top_labels = X.columns[top_idx]
top_components = pca.components_[:2, top_idx]
plt.figure(figsize=(8, 8))
circle = plt.Circle((0, 0), 0.5, color="gray", fill=False)
plt.gca().add_artist(circle)
for i, (x, y) in enumerate(zip(top_components[0, :], top_components[1, :])):
plt.arrow(0, 0, x, y, color="r", alpha=0.6, head_width=0.02)
plt.text(
x * 1.15,
y * 1.15,
top_labels[i],
color="b",
ha="center",
va="center",
fontsize=9,
)
plt.xlim(-0.5, 0.5)
plt.ylim(-0.5, 0.5)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("Correlation Circle (top features)")
plt.grid()
plt.axhline(0, color="black", linewidth=0.5)
plt.axvline(0, color="black", linewidth=0.5)
plt.show()
Analysis of Principal Components & Feature CorrelationΒΆ
The correlation circle plot above visualizes the loadings of the 30 most significant features onto the first two principal components (PC1 and PC2), which together capture the largest sources of variance in our dataset.
The plot provides clear visual evidence of significant multicollinearity. We can observe several distinct, tightly-grouped clusters of variables:
- Monetary Aggregates: Features like
M2REAL,M2SL, andBOGMBASEpoint in the same direction, indicating they are highly correlated and capture the same underlying "monetary base" factor. - Foreign Yields: The 10-year yields of other major OECD countries (e.g.,
IRLTLT01DEM156N,IRLTLT01JPM156N) form another strong cluster. - Lagged Domestic Yields: A large group of lagged yield features are bundled together, confirming the high serial correlation.
- Inflation & Employment Components: Sectoral breakdowns of inflation (e.g.,
CPIAUCSL,CPIAPPSL) and employment (e.g.,USWTRADE,USTRADE) form their own respective clusters.
Given this high degree of collinearity, training a model on the raw features could lead to unstable coefficients and overfitting. Therefore, applying PCA as a preprocessing step is a well-justified method to reduce dimensionality and transform our correlated features into a more robust, orthogonal set of components.
3.2.2. Weekly DataΒΆ
dus_weekly = dus_weekly.dropna()
dus_weekly.head()
| RPI | INDPRO | UNRATE | UEMP5TO14 | PAYEMS | USGOOD | USCONS | MANEMP | DMANEMP | NDMANEMP | ... | DCOILWTICO | FEDFUNDS | CURRCIR | BOGMBASE | WSHOSHO | WSHOTSL | T5YIE | T10YIE | XAU_LogReturns | SP500_LogReturns | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003-04-04 | 0.004111 | -0.006278 | 0.016807 | 0.035964 | -0.000484 | -0.003744 | 0.005246 | -0.007706 | -0.008612 | -0.006242 | ... | -0.022277 | 0.007968 | 0.005315 | 0.006078 | 0.000906 | 0.000906 | -0.01 | -0.01 | 0.000983 | 0.002735 |
| 2003-04-11 | 0.004111 | -0.006278 | 0.016807 | 0.035964 | -0.000484 | -0.003744 | 0.005246 | -0.007706 | -0.008612 | -0.006242 | ... | 0.038938 | 0.007968 | 0.005315 | 0.006078 | 0.005599 | 0.005600 | -0.02 | -0.02 | 0.005132 | -0.003770 |
| 2003-04-18 | 0.004111 | -0.006278 | 0.016807 | 0.035964 | -0.000484 | -0.003744 | 0.005246 | -0.007706 | -0.008612 | -0.006242 | ... | 0.031727 | 0.007968 | 0.005315 | 0.006078 | 0.002649 | 0.002649 | 0.00 | 0.00 | 0.002295 | 0.015416 |
| 2003-04-25 | 0.004111 | -0.006278 | 0.016807 | 0.035964 | -0.000484 | -0.003744 | 0.005246 | -0.007706 | -0.008612 | -0.006242 | ... | -0.059898 | 0.007968 | 0.005315 | 0.006078 | 0.000164 | 0.000164 | 0.00 | 0.02 | -0.003593 | -0.013943 |
| 2003-05-02 | 0.007335 | -0.000344 | 0.016529 | -0.011338 | 0.000184 | -0.001327 | 0.002538 | -0.003429 | -0.003777 | -0.002866 | ... | -0.011972 | 0.000000 | 0.005032 | 0.005761 | 0.000587 | 0.000587 | 0.04 | 0.00 | -0.002050 | 0.014927 |
5 rows Γ 78 columns
To enrich our feature set, we are engineering a comprehensive list of endogenous features based on the target's own history. This function generates several categories of predictors:
Rolling Statistics: Captures the target's recent distribution (mean, std, range, and quantiles) over multiple windows (10, 20, and 50 weeks).
Momentum & Value Signals: Creates normalized indicators like a rolling Z-score, a fast/slow moving average ratio (10-week vs. 50-week), and simple momentum.
Raw Historical Lags: A sparse set of lagged values (e.g., t-1, t-2, t-5, ... t-50).
def create_lagged_features(df, columns, max_lag=20, windows=[10, 20, 50]):
"""
Creates a DataFrame of lagged features from the original yield series.
All features are shifted by 1 to prevent look-ahead bias.
"""
all_features = {}
for col in columns:
y = df[col]
# Rolling statistics (calculates up to t)
for w in windows:
roll = y.rolling(w)
all_features[f"{col}_mean_{w}"] = roll.mean()
all_features[f"{col}_std_{w}"] = roll.std()
all_features[f"{col}_q25_{w}"] = roll.quantile(0.25)
all_features[f"{col}_q75_{w}"] = roll.quantile(0.75)
all_features[f"{col}_q05_{w}"] = roll.quantile(0.05)
all_features[f"{col}_q90_{w}"] = roll.quantile(0.9)
all_features[f"{col}_range_{w}"] = roll.max() - roll.min()
# Z-score and momentum (both use y[t])
all_features[f"{col}_zscore_{w}"] = (y - roll.mean()) / roll.std()
all_features[f"{col}_momentum_{w}"] = y.diff(w) # y[t] - y[t-w]
# Ratio of fast / slow moving averages
all_features[f"{col}_ratio_{10}_{50}"] = (
y.rolling(10).mean() / y.rolling(50).mean()
)
# Approximate annualized volatility
all_features[f"{col}_vol_20"] = y.rolling(20).std() * np.sqrt(52)
# Raw lags (starts from y[t])
# NOTE: The loop now starts at lag=0
for lag in [0, 1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 40, 50]:
all_features[f"{col}_lag_{lag}"] = y.shift(lag)
features = pd.DataFrame(all_features, index=df.index)
# This makes all features at time 't' based on data from 't-1' or earlier.
features_lagged = features.shift(1)
return features_lagged
yield_cols = [
"DGS1MO",
"DGS3MO",
"DGS6MO",
"DGS1",
"DGS2",
"DGS3",
"DGS5",
"DGS7",
"DGS10",
"DGS20",
"DGS30",
]
# Create TARGETS (Y)
# Y at time 't' is the value from 't+1'
Y_weekly = pd.DataFrame(index=dus_weekly.index)
for col in yield_cols:
Y_weekly[f"Y_{col}"] = dus_weekly[col].shift(-1)
Y_weekly.head()
| Y_DGS1MO | Y_DGS3MO | Y_DGS6MO | Y_DGS1 | Y_DGS2 | Y_DGS3 | Y_DGS5 | Y_DGS7 | Y_DGS10 | Y_DGS20 | Y_DGS30 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003-04-04 | 0.00 | 0.03 | 0.03 | 0.05 | 0.05 | 0.06 | 0.06 | 0.03 | 0.05 | 0.02 | 0.02 |
| 2003-04-11 | -0.01 | 0.01 | 0.00 | 0.03 | 0.04 | 0.05 | 0.04 | 0.02 | 0.02 | -0.01 | -0.01 |
| 2003-04-18 | 0.00 | -0.01 | -0.01 | -0.02 | -0.05 | -0.05 | -0.04 | -0.03 | -0.02 | -0.01 | -0.01 |
| 2003-04-25 | -0.02 | 0.02 | 0.03 | 0.06 | 0.06 | 0.07 | 0.08 | 0.08 | 0.06 | 0.04 | 0.04 |
| 2003-05-02 | 0.01 | 0.02 | 0.00 | -0.02 | -0.04 | -0.01 | -0.01 | 0.00 | -0.01 | -0.01 | -0.01 |
# Create your FEATURES (X)
dus_weekly_features = create_lagged_features(dus_weekly, yield_cols)
dus_weekly_features.sample(5)
| DGS1MO_mean_10 | DGS1MO_std_10 | DGS1MO_q25_10 | DGS1MO_q75_10 | DGS1MO_q05_10 | DGS1MO_q90_10 | DGS1MO_range_10 | DGS1MO_zscore_10 | DGS1MO_momentum_10 | DGS1MO_mean_20 | ... | DGS30_lag_3 | DGS30_lag_4 | DGS30_lag_5 | DGS30_lag_10 | DGS30_lag_15 | DGS30_lag_20 | DGS30_lag_25 | DGS30_lag_30 | DGS30_lag_40 | DGS30_lag_50 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2018-04-27 | 0.008 | 0.023476 | -0.0075 | 0.0175 | -0.0200 | 0.041 | 0.07 | -1.192720e+00 | -0.01 | 0.0040 | ... | -0.04 | 0.00 | 0.03 | 0.00 | 0.02 | -0.07 | -0.03 | 0.00 | -0.01 | -0.01 |
| 2020-07-17 | -0.002 | 0.006325 | -0.0075 | 0.0000 | -0.0100 | 0.001 | 0.02 | -1.264911e+00 | -0.01 | -0.0145 | ... | 0.00 | 0.04 | 0.07 | -0.01 | -0.13 | -0.07 | 0.03 | -0.06 | -0.03 | -0.01 |
| 2008-10-17 | -0.005 | 0.191848 | -0.1050 | -0.0075 | -0.1710 | 0.139 | 0.67 | -5.994337e-01 | -0.09 | -0.0280 | ... | 0.22 | 0.12 | 0.00 | -0.02 | -0.09 | -0.06 | -0.03 | -0.12 | -0.01 | 0.02 |
| 2019-01-25 | -0.010 | 0.019437 | -0.0175 | 0.0000 | -0.0400 | 0.011 | 0.06 | -1.133485e-14 | -0.01 | -0.0055 | ... | -0.01 | 0.01 | -0.02 | -0.03 | 0.05 | 0.02 | -0.01 | 0.00 | -0.02 | 0.07 |
| 2009-08-14 | -0.003 | 0.010593 | -0.0100 | 0.0000 | -0.0155 | 0.002 | 0.04 | -6.607826e-01 | -0.01 | -0.0005 | ... | 0.08 | -0.11 | -0.02 | -0.20 | 0.09 | 0.03 | 0.21 | 0.00 | 0.05 | 0.00 |
5 rows Γ 462 columns
# Align data
# This merges X[t-1] with Y[t] (which is the yield at t+1)
# The .dropna() removes:
# a) NaNs from the start of the features (from rolling/lagging)
# b) The final row(s) of Y_targets (which are NaN from the .shift(-1))
dus_weekly_augmented = (
dus_weekly.merge(
right=dus_weekly_features, how="left", left_index=True, right_index=True
)
.merge(right=Y_weekly, how="left", left_index=True, right_index=True)
.dropna()
)
dus_weekly_augmented.head()
| RPI | INDPRO | UNRATE | UEMP5TO14 | PAYEMS | USGOOD | USCONS | MANEMP | DMANEMP | NDMANEMP | ... | Y_DGS3MO | Y_DGS6MO | Y_DGS1 | Y_DGS2 | Y_DGS3 | Y_DGS5 | Y_DGS7 | Y_DGS10 | Y_DGS20 | Y_DGS30 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2004-03-26 | 0.003610 | -0.003912 | 0.035091 | -0.000821 | 0.002435 | 0.002992 | 0.007140 | 0.000560 | 0.000902 | 0.000000 | ... | 0.02 | 0.01 | 0.07 | 0.21 | 0.27 | 0.28 | 0.28 | 0.24 | 0.20 | 0.12 |
| 2004-04-02 | 0.003221 | 0.003871 | -0.035091 | -0.014067 | 0.002080 | 0.002020 | 0.002031 | 0.001958 | 0.002926 | 0.000369 | ... | -0.02 | 0.00 | 0.01 | 0.01 | 0.01 | 0.03 | 0.02 | 0.02 | 0.02 | -0.01 |
| 2004-04-09 | 0.003221 | 0.003871 | -0.035091 | -0.014067 | 0.002080 | 0.002020 | 0.002031 | 0.001958 | 0.002926 | 0.000369 | ... | -0.01 | -0.04 | -0.06 | -0.07 | -0.06 | -0.06 | -0.06 | -0.05 | -0.02 | -0.02 |
| 2004-04-16 | 0.003221 | 0.003871 | -0.035091 | -0.014067 | 0.002080 | 0.002020 | 0.002031 | 0.001958 | 0.002926 | 0.000369 | ... | 0.01 | 0.06 | 0.10 | 0.14 | 0.14 | 0.12 | 0.12 | 0.08 | 0.07 | 0.05 |
| 2004-04-23 | 0.003221 | 0.003871 | -0.035091 | -0.014067 | 0.002080 | 0.002020 | 0.002031 | 0.001958 | 0.002926 | 0.000369 | ... | 0.01 | 0.02 | 0.00 | -0.03 | -0.02 | -0.03 | -0.03 | -0.02 | -0.02 | -0.02 |
5 rows Γ 551 columns
dus_weekly_augmented = dus_weekly_augmented.replace([np.inf, -np.inf], np.nan)
dus_weekly_augmented = dus_weekly_augmented.ffill()
dus_weekly_augmented.head()
| RPI | INDPRO | UNRATE | UEMP5TO14 | PAYEMS | USGOOD | USCONS | MANEMP | DMANEMP | NDMANEMP | ... | Y_DGS3MO | Y_DGS6MO | Y_DGS1 | Y_DGS2 | Y_DGS3 | Y_DGS5 | Y_DGS7 | Y_DGS10 | Y_DGS20 | Y_DGS30 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2004-03-26 | 0.003610 | -0.003912 | 0.035091 | -0.000821 | 0.002435 | 0.002992 | 0.007140 | 0.000560 | 0.000902 | 0.000000 | ... | 0.02 | 0.01 | 0.07 | 0.21 | 0.27 | 0.28 | 0.28 | 0.24 | 0.20 | 0.12 |
| 2004-04-02 | 0.003221 | 0.003871 | -0.035091 | -0.014067 | 0.002080 | 0.002020 | 0.002031 | 0.001958 | 0.002926 | 0.000369 | ... | -0.02 | 0.00 | 0.01 | 0.01 | 0.01 | 0.03 | 0.02 | 0.02 | 0.02 | -0.01 |
| 2004-04-09 | 0.003221 | 0.003871 | -0.035091 | -0.014067 | 0.002080 | 0.002020 | 0.002031 | 0.001958 | 0.002926 | 0.000369 | ... | -0.01 | -0.04 | -0.06 | -0.07 | -0.06 | -0.06 | -0.06 | -0.05 | -0.02 | -0.02 |
| 2004-04-16 | 0.003221 | 0.003871 | -0.035091 | -0.014067 | 0.002080 | 0.002020 | 0.002031 | 0.001958 | 0.002926 | 0.000369 | ... | 0.01 | 0.06 | 0.10 | 0.14 | 0.14 | 0.12 | 0.12 | 0.08 | 0.07 | 0.05 |
| 2004-04-23 | 0.003221 | 0.003871 | -0.035091 | -0.014067 | 0.002080 | 0.002020 | 0.002031 | 0.001958 | 0.002926 | 0.000369 | ... | 0.01 | 0.02 | 0.00 | -0.03 | -0.02 | -0.03 | -0.03 | -0.02 | -0.02 | -0.02 |
5 rows Γ 551 columns
To reduce dimensionality and noise, we are removing features with a weak linear relationship to the target. All predictors with an absolute correlation coefficient below 0.1 are filtered out from the feature set.
# Create weekly features dataset by dropping the target columns
Xw = dus_weekly_augmented.drop(columns=Y_weekly.columns)
# Align target dataset index witht Xw
Y_weekly = Y_weekly.reindex(Xw.index)
Y_weekly.head()
| Y_DGS1MO | Y_DGS3MO | Y_DGS6MO | Y_DGS1 | Y_DGS2 | Y_DGS3 | Y_DGS5 | Y_DGS7 | Y_DGS10 | Y_DGS20 | Y_DGS30 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2004-03-26 | -0.02 | 0.02 | 0.01 | 0.07 | 0.21 | 0.27 | 0.28 | 0.28 | 0.24 | 0.20 | 0.12 |
| 2004-04-02 | -0.01 | -0.02 | 0.00 | 0.01 | 0.01 | 0.01 | 0.03 | 0.02 | 0.02 | 0.02 | -0.01 |
| 2004-04-09 | -0.01 | -0.01 | -0.04 | -0.06 | -0.07 | -0.06 | -0.06 | -0.06 | -0.05 | -0.02 | -0.02 |
| 2004-04-16 | 0.06 | 0.01 | 0.06 | 0.10 | 0.14 | 0.14 | 0.12 | 0.12 | 0.08 | 0.07 | 0.05 |
| 2004-04-23 | -0.02 | 0.01 | 0.02 | 0.00 | -0.03 | -0.02 | -0.03 | -0.03 | -0.02 | -0.02 | -0.02 |
corrs = pd.DataFrame(
{target: Xw.corrwith(Y_weekly[target]) for target in Y_weekly.columns}
).abs()
mask = (corrs < 0.1).all(axis=1)
low_corr_features = corrs.index[mask]
Xw_filtered = Xw.drop(columns=low_corr_features)
print(f"{len(low_corr_features)} features were deleted, namely:\n{low_corr_features}")
440 features were deleted, namely:
Index(['RPI', 'INDPRO', 'UNRATE', 'UEMP5TO14', 'PAYEMS', 'USGOOD', 'USCONS',
'MANEMP', 'DMANEMP', 'NDMANEMP',
...
'DGS30_lag_3', 'DGS30_lag_4', 'DGS30_lag_5', 'DGS30_lag_10',
'DGS30_lag_15', 'DGS30_lag_20', 'DGS30_lag_25', 'DGS30_lag_30',
'DGS30_lag_40', 'DGS30_lag_50'],
dtype='object', length=440)
plt.figure(figsize=(25, 25))
sns.heatmap(Xw_filtered.corr(), cmap="seismic", center=0)
3.2.2. Weekly Data for Binary ClassificationΒΆ
Since there will also be be binary classifaction models, let's prepare the targets for that matter. Note that the classification targets are directions: does the yield go up or down from one time point to another.
Y_weekly
| Y_DGS1MO | Y_DGS3MO | Y_DGS6MO | Y_DGS1 | Y_DGS2 | Y_DGS3 | Y_DGS5 | Y_DGS7 | Y_DGS10 | Y_DGS20 | Y_DGS30 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2004-03-26 | -0.02 | 0.02 | 0.01 | 0.07 | 0.21 | 0.27 | 0.28 | 0.28 | 0.24 | 0.20 | 0.12 |
| 2004-04-02 | -0.01 | -0.02 | 0.00 | 0.01 | 0.01 | 0.01 | 0.03 | 0.02 | 0.02 | 0.02 | -0.01 |
| 2004-04-09 | -0.01 | -0.01 | -0.04 | -0.06 | -0.07 | -0.06 | -0.06 | -0.06 | -0.05 | -0.02 | -0.02 |
| 2004-04-16 | 0.06 | 0.01 | 0.06 | 0.10 | 0.14 | 0.14 | 0.12 | 0.12 | 0.08 | 0.07 | 0.05 |
| 2004-04-23 | -0.02 | 0.01 | 0.02 | 0.00 | -0.03 | -0.02 | -0.03 | -0.03 | -0.02 | -0.02 | -0.02 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2024-11-29 | -0.02 | -0.04 | -0.04 | -0.04 | -0.05 | -0.05 | -0.04 | -0.03 | -0.02 | -0.01 | 0.01 |
| 2024-12-06 | 0.00 | -0.01 | 0.01 | 0.02 | 0.07 | 0.07 | 0.07 | 0.08 | 0.08 | 0.07 | 0.06 |
| 2024-12-13 | 0.01 | -0.01 | 0.00 | -0.01 | -0.02 | -0.03 | -0.06 | -0.06 | -0.05 | -0.03 | -0.02 |
| 2024-12-20 | -0.01 | -0.04 | -0.02 | -0.03 | 0.01 | 0.01 | 0.03 | 0.04 | 0.04 | 0.06 | 0.06 |
| 2024-12-27 | -0.03 | 0.00 | -0.01 | -0.01 | 0.01 | -0.02 | 0.01 | 0.02 | 0.03 | 0.02 | 0.01 |
1084 rows Γ 11 columns
Yw_binary = (Y_weekly > 0).astype(int)
Xw = dus_weekly_augmented.drop(columns=Yw_binary.columns)
Our feature engineering process on the yield time series resulted in a high-dimensional dataset. To mitigate the risk of overfitting and the curse of dimensionality, a robust feature selection step is essential. We will employ a filter-based method using Mutual Information (MI), which is great at capturing non-linear relationships that simple correlation might miss.
To determine the optimal feature set, we will create several distinct subsets of the data. Each subset will contain only those features with an MI score above a specific threshold. We will test a range of thresholdsβnamely 0.03, 0.035, 0.04, and 0.05βand train our predictive models on each of these datasets to evaluate which level of parsimony yields the best out-of-sample performance.
initial_number_of_features = Xw.shape[1]
threshold_list = [0.03, 0.035, 0.04, 0.05]
datasets = {t: pd.DataFrame() for t in threshold_list}
for threshold_mi in tqdm(threshold_list):
selected_features_w = set()
for col in Yw_binary.columns:
mi = mutual_info_classif(Xw, Yw_binary[col], random_state=SEED)
top_features_i = Xw.columns[mi > threshold_mi]
selected_features_w.update(top_features_i)
datasets[threshold_mi] = Xw[list(selected_features_w)]
print(
f"{initial_number_of_features - len(selected_features_w)} features were removed for threshold {threshold_mi}."
)
print(
f"So that {datasets[threshold_mi].shape[1]} features are left for that threshold {threshold_mi}."
)
print(
f"Features that are left, bar target related ones, are\
\n{[col for col in datasets[threshold_mi].columns if not col.startswith(('Y', 'DGS'))]}."
)
25%|βββ | 1/4 [00:28<01:25, 28.38s/it]
292 features were removed for threshold 0.03. So that 248 features are left for that threshold 0.03. Features that are left, bar target related ones, are ['UEMP5TO14', 'CPIAPPSL', 'IRLTLT01AUM156N', 'WSHOTSL', 'DTCOLNVHFNM', 'USGOOD', 'UNRATE', 'FEDFUNDS', 'DMANEMP', 'BOGMBASE', 'USCONS', 'GFDEBTN', 'USWTRADE', 'INDPRO', 'CUUR0000SAD', 'USGOVT', 'GFDEGDQ188S', 'IRLTLT01DEM156N', 'USTRADE', 'CPIULFSL', 'M2REAL', 'PCEPI', 'M1SL', 'IRLTLT01CAM156N', 'TOTRESNS', 'MANEMP', 'DNDGRG3M086SBEA', 'PERMIT', 'M2SL', 'DCOILWTICO', 'FYGFDPUN', 'FGEXPND', 'RPI', 'HOUST', 'USTPU', 'NDMANEMP', 'MTSDS133FMS', 'CUUR0000SA0L2', 'CURRCIR', 'CPIAUCSL', 'UMCSENT', 'CUSR0000SAC', 'SRVPRD', 'NONREVSL', 'PAYEMS', 'CPITRNSL', 'FGDEF', 'USFIRE', 'IRLTLT01GBM156N', 'DTCTHFNM'].
50%|βββββ | 2/4 [00:58<00:58, 29.44s/it]
355 features were removed for threshold 0.035. So that 185 features are left for that threshold 0.035. Features that are left, bar target related ones, are ['UEMP5TO14', 'CPIAPPSL', 'IRLTLT01AUM156N', 'DTCOLNVHFNM', 'USGOOD', 'UNRATE', 'FEDFUNDS', 'DMANEMP', 'BOGMBASE', 'USCONS', 'GFDEBTN', 'USWTRADE', 'INDPRO', 'CUUR0000SAD', 'USGOVT', 'GFDEGDQ188S', 'IRLTLT01DEM156N', 'USTRADE', 'CPIULFSL', 'M2REAL', 'PCEPI', 'M1SL', 'IRLTLT01CAM156N', 'TOTRESNS', 'MANEMP', 'DNDGRG3M086SBEA', 'PERMIT', 'M2SL', 'DCOILWTICO', 'FYGFDPUN', 'FGEXPND', 'RPI', 'HOUST', 'USTPU', 'NDMANEMP', 'MTSDS133FMS', 'CUUR0000SA0L2', 'CURRCIR', 'CPIAUCSL', 'UMCSENT', 'CUSR0000SAC', 'SRVPRD', 'NONREVSL', 'PAYEMS', 'CPITRNSL', 'FGDEF', 'USFIRE', 'IRLTLT01GBM156N', 'DTCTHFNM'].
75%|ββββββββ | 3/4 [01:28<00:29, 29.49s/it]
401 features were removed for threshold 0.04. So that 139 features are left for that threshold 0.04. Features that are left, bar target related ones, are ['UEMP5TO14', 'CPIAPPSL', 'IRLTLT01AUM156N', 'DTCOLNVHFNM', 'USGOOD', 'UNRATE', 'FEDFUNDS', 'DMANEMP', 'BOGMBASE', 'USCONS', 'GFDEBTN', 'USWTRADE', 'INDPRO', 'CUUR0000SAD', 'USGOVT', 'GFDEGDQ188S', 'IRLTLT01DEM156N', 'USTRADE', 'CPIULFSL', 'M2REAL', 'PCEPI', 'M1SL', 'IRLTLT01CAM156N', 'TOTRESNS', 'MANEMP', 'DNDGRG3M086SBEA', 'PERMIT', 'M2SL', 'DCOILWTICO', 'FYGFDPUN', 'FGEXPND', 'RPI', 'HOUST', 'USTPU', 'NDMANEMP', 'MTSDS133FMS', 'CUUR0000SA0L2', 'CURRCIR', 'CPIAUCSL', 'UMCSENT', 'CUSR0000SAC', 'SRVPRD', 'NONREVSL', 'PAYEMS', 'CPITRNSL', 'FGDEF', 'USFIRE', 'IRLTLT01GBM156N', 'DTCTHFNM'].
100%|ββββββββββ| 4/4 [01:53<00:00, 28.47s/it]
459 features were removed for threshold 0.05. So that 81 features are left for that threshold 0.05. Features that are left, bar target related ones, are ['UEMP5TO14', 'CPIAPPSL', 'IRLTLT01AUM156N', 'DTCOLNVHFNM', 'USGOOD', 'FEDFUNDS', 'DMANEMP', 'BOGMBASE', 'USCONS', 'GFDEBTN', 'USWTRADE', 'INDPRO', 'CUUR0000SAD', 'USGOVT', 'GFDEGDQ188S', 'IRLTLT01DEM156N', 'USTRADE', 'CPIULFSL', 'M2REAL', 'PCEPI', 'M1SL', 'IRLTLT01CAM156N', 'TOTRESNS', 'MANEMP', 'DNDGRG3M086SBEA', 'PERMIT', 'M2SL', 'FYGFDPUN', 'FGEXPND', 'RPI', 'HOUST', 'USTPU', 'NDMANEMP', 'MTSDS133FMS', 'CUUR0000SA0L2', 'CURRCIR', 'CPIAUCSL', 'UMCSENT', 'CUSR0000SAC', 'SRVPRD', 'NONREVSL', 'PAYEMS', 'CPITRNSL', 'FGDEF', 'USFIRE', 'IRLTLT01GBM156N', 'DTCTHFNM'].
The feature selection process revealed that many of the macroeconomic and market variables sourced from FRED had low Mutual Information (MI) scores with our target. As a result, applying the MI threshold filtered out a significant portion of these exogenous features, suggesting they have a weak predictive relationship with yield directions.
4. Methodology and Model DesignΒΆ
4.1 Regressions on Daily DatasetΒΆ
Methodology: Regularized Linear ModelsΒΆ
To establish a robust baseline for our forecasts, we first implement a suite of regularized linear models. This model class is well-suited for high-dimensional, noisy financial data as it is designed to prevent overfitting and perform implicit feature selection.
We will test three classic regularization techniques:
- Ridge Regression (L2 penalty): Shrinks all feature coefficients toward zero, which is highly effective at handling multicollinearity.
- Lasso Regression (L1 penalty): Shrinks less important feature coefficients to be exactly zero, performing automatic feature selection.
- ElasticNet (L1 + L2 penalty): A hybrid model that combines the strengths of both Ridge and Lasso, useful when many features are correlated.
Backtesting FrameworkΒΆ
All models are evaluated using a rolling-window (walk-forward) validation to simulate a realistic out-of-sample trading environment. The framework is as follows:
- Window Sizing: A training window of 4 or 10 years (1008 or 2520 days) is used to predict the subsequent 21-day (1-month) forecast horizon.
- Rolling: The window is rolled forward by 21 days, and the process is repeated.
- Hyperparameter Tuning: To avoid data leakage, hyperparameters (
alphaandl1_ratio) are tuned inside each training window. We use theMultiTaskLassoCVandMultiTaskElasticNetCV(andRidgeCV) models, which perform this tuning efficiently using an innerTimeSeriesSplit.
Experimental ComparisonΒΆ
Our primary experiment for this section is to compare two distinct preprocessing pipelines for each model:
- Model (No PCA):
StandardScaler$\rightarrow$Model - Model (With PCA):
StandardScaler$\rightarrow$PCA(n_components=0.95)$\rightarrow$Model
This comparison will allow us to determine if explicitly handling multicollinearity and reducing dimensionality with PCA improves the out-of-sample performance and stability of the regularized models. Performance will be measured using OOS R-squared (RΒ²), directional accuracy (hit rate), and the number of features selected by the Lasso/ElasticNet models.
4.1.1. Ridge RegressionΒΆ
# Define Backtest Configuration
# Define the rolling window parameters
TRAIN_WINDOW_SIZE = 252 * 4 # 4 years of daily data (1008)
FORECAST_HORIZON = 21 # 21 days (approx. 1 month)
# Define the RidgeCV hyperparameter grid
RIDGE_ALPHAS = np.logspace(-3, 3, 20)
# Define the inner-loop cross-validator for RidgeCV
INNER_LOOP_CV = TimeSeriesSplit(n_splits=4)
# Define Model Pipelines
model_pipelines = {
"Ridge_PCA": Pipeline(
[
("scaler", StandardScaler()),
("pca", PCA(n_components=0.95, random_state=SEED)), # Retain 95% of variance
(
"ridge",
MultiOutputRegressor(
RidgeCV(alphas=RIDGE_ALPHAS, cv=INNER_LOOP_CV, fit_intercept=False)
),
),
]
),
"Ridge_NoPCA": Pipeline(
[
("scaler", StandardScaler()),
(
"ridge",
MultiOutputRegressor(
RidgeCV(alphas=RIDGE_ALPHAS, cv=INNER_LOOP_CV, fit_intercept=False)
),
),
]
),
}
# Define Backtest Helper Function
def run_backtest_fold(model, X_train, Y_train, X_test, Y_test):
"""
Fits a model on the training fold, predicts the test fold,
and returns all relevant predictions and performance metrics.
"""
# Fit the model pipeline
model.fit(X_train, Y_train)
# Generate in-sample (IS) and out-of-sample (OOS) predictions
Y_pred_is = model.predict(X_train)
Y_pred_oos = model.predict(X_test)
# Calculate performance metrics
r2_is = r2_score(Y_train, Y_pred_is, multioutput="raw_values")
r2_oos = r2_score(Y_test, Y_pred_oos, multioutput="raw_values")
# Calculate directional accuracy (hit rate)
hit_rate = np.mean(np.sign(Y_test.values) == np.sign(Y_pred_oos), axis=0)
# Store results in a structured dictionary
fold_results = {
"r2_in_sample": r2_is,
"r2_out_of_sample": r2_oos,
"hit_rate": hit_rate,
"predictions": Y_pred_oos,
"target_dates": X_test.index,
}
return fold_results
# Execute the Rolling Window Backtest
# Initialize a nested dictionary to store all results
backtest_results = {model_name: [] for model_name in model_pipelines.keys()}
# Define the total number of steps in the backtest
total_steps = len(X) - TRAIN_WINDOW_SIZE - FORECAST_HORIZON + 1
for start_idx in tqdm(range(0, total_steps, FORECAST_HORIZON)):
# Define window indices
train_end_idx = start_idx + TRAIN_WINDOW_SIZE
test_end_idx = train_end_idx + FORECAST_HORIZON
# Slice data for the current fold
X_train = X.iloc[start_idx:train_end_idx]
Y_train = Y.iloc[start_idx:train_end_idx]
X_test = X.iloc[train_end_idx:test_end_idx]
Y_test = Y.iloc[train_end_idx:test_end_idx]
# Iterate through each defined model
for model_name, pipeline in model_pipelines.items():
# Run the evaluation for this fold
results = run_backtest_fold(pipeline, X_train, Y_train, X_test, Y_test)
backtest_results[model_name].append(results)
print("Backtest complete.")
100%|ββββββββββ| 210/210 [03:04<00:00, 1.14it/s]
Backtest complete.
# Helper Function to Process Backtest Results
def extract_metric_dataframe(
backtest_results, model_name, metric_key, target_columns=None
):
"""
Extracts a specific metric from the backtest_results dictionary
and returns it as a DataFrame.
Args:
backtest_results (dict): The main dictionary holding all fold results.
model_name (str): The key for the model (e.g., 'Ridge_PCA').
metric_key (str): The key for the metric (e.g., 'r2_out_of_sample').
target_columns (pd.Index): The column names of the original Y DataFrame.
Returns:
pd.DataFrame: A DataFrame where each row is a fold and columns are targets.
"""
results_list = backtest_results[model_name]
metric_data = [fold[metric_key] for fold in results_list]
fold_indices = [fold["target_dates"][0] for fold in results_list]
# Handle multi-output metrics (like R2) vs. single-value metrics (like alpha)
if target_columns is not None and isinstance(metric_data[0], (np.ndarray, list)):
df = pd.DataFrame(metric_data, columns=target_columns, index=fold_indices)
else:
# This handles single-value metrics (like alpha)
df = pd.DataFrame(metric_data, columns=[metric_key], index=fold_indices)
df.index.name = "Prediction_Start_Date"
return df
# Extract All Metrics into DataFrames
target_cols = Y.columns
# In-Sample R2
r2_is_df = extract_metric_dataframe(
backtest_results, "Ridge_PCA", "r2_in_sample", target_cols
)
r2_is_df_nopca = extract_metric_dataframe(
backtest_results, "Ridge_NoPCA", "r2_in_sample", target_cols
)
# Out-of-Sample R2
r2_os_df = extract_metric_dataframe(
backtest_results, "Ridge_PCA", "r2_out_of_sample", target_cols
)
r2_os_df_nopca = extract_metric_dataframe(
backtest_results, "Ridge_NoPCA", "r2_out_of_sample", target_cols
)
# Hit Rate
hr_df = extract_metric_dataframe(backtest_results, "Ridge_PCA", "hit_rate", target_cols)
hr_df_nopca = extract_metric_dataframe(
backtest_results, "Ridge_NoPCA", "hit_rate", target_cols
)
# Create the Visualization Grid
plt.close("all") # Close any existing figures
fig, axes = plt.subplots(3, 2, figsize=(28, 28))
plot_config = [
{
"df": r2_is_df,
"ax": axes[0, 0],
"title": "In-Sample RΒ² per Fold (with PCA)",
"ylim": None,
},
{
"df": r2_is_df_nopca,
"ax": axes[0, 1],
"title": "In-Sample RΒ² per Fold (no PCA)",
"ylim": None,
},
{
"df": r2_os_df,
"ax": axes[1, 0],
"title": "Out-of-Sample RΒ² per Fold (with PCA)",
"ylim": (-0.3, 0.3),
},
{
"df": r2_os_df_nopca,
"ax": axes[1, 1],
"title": "Out-of-Sample RΒ² per Fold (no PCA)",
"ylim": (-0.3, 0.3),
},
{
"df": hr_df,
"ax": axes[2, 0],
"title": "Out-of-Sample Hit Rate per Fold (with PCA)",
"ylim": None,
},
{
"df": hr_df_nopca,
"ax": axes[2, 1],
"title": "Out-of-Sample Hit Rate per Fold (no PCA)",
"ylim": None,
},
]
for config in plot_config:
config["df"].plot(ax=config["ax"], title=config["title"], legend=True)
config["ax"].set_xlabel("Prediction Start Date")
config["ax"].grid(True, axis="y", linestyle="--", linewidth=0.5)
if config["ylim"]:
config["ax"].set_ylim(config["ylim"])
fig.suptitle(
"Rolling Window Model Performance: PCA vs. No PCA (4-Year Window, 21-Day Horizon)",
fontsize=24,
y=1.02,
)
plt.tight_layout(rect=[0, 0.03, 1, 0.98])
plt.show()
Analysis of Rolling-Window Lasso PerformanceΒΆ
Our analysis of the rolling-window backtest for the RidgeCV models (with and without PCA) yields several findings regarding in-sample fit and out-of-sample predictive power.
In-Sample Performance (RΒ²)ΒΆ
The in-sample R-squared (IS RΒ²) is low across all maturities, indicating that the selected features explain only a small fraction of the variance in yields. This explanatory power is not uniform across the curve; the model demonstrates a modest fit for short-term yields (1-month to 1-year) which deteriorates significantly for long-term maturities.
Furthermore, the IS RΒ² is highly unstable over time. We observe significant variations in model fit during the rolling windows, particularly around 2011 and 2020. This suggests the model's coefficients are sensitive to outliers and structural breaks within the training data.
Out-of-Sample Performance (RΒ² and Hit Rate)ΒΆ
The out-of-sample (OOS) R-squared provides the true test of predictive skill. Across all target maturities, the OOS RΒ² is consistently near-zero or negative. This result signifies a complete lack of predictive power; the model fails to outperform a simple historical mean forecast.
This failure is confirmed by the directional accuracy (hit rate), which averages approximately 0.5. This result is statistically indistinguishable from a random guess (a coin flip), indicating the model has no discernible skill in predicting the direction of future yield changes.
Effect of PCA and ConclusionΒΆ
Introducing PCA into the pipeline did not produce any meaningful change in the in-sample RΒ². While we did observe that the hit rate became slightly less variable when using PCA, suggesting a minor reduction in model variance, the overall out-of-sample results were equally poor.
Conclusion: The combination of a modest in-sample fit with non-existent out-of-sample performance (negative OOS RΒ² and a ~0.5 hit rate) points to a classic case of overfitting. The model is fitting to noise within the training window but fails to generalize or capture any durable, predictive relationships.
4.1.2. Lasso RegressionΒΆ
# Define the LassoCV hyperparameter grid
LASSO_ALPHAS = np.logspace(-5, 2, 20)
INNER_LOOP_CV_LASSO = TimeSeriesSplit(n_splits=4)
# Define Model Pipelines
model_pipelines = {
"Lasso_PCA": Pipeline(
[
("scaler", StandardScaler()),
("pca", PCA(n_components=0.95, random_state=SEED)),
(
"lasso",
MultiTaskLassoCV(
fit_intercept=False,
alphas=LASSO_ALPHAS,
cv=INNER_LOOP_CV,
max_iter=5000,
tol=1e-3,
random_state=SEED,
),
),
]
),
"Lasso_NoPCA": Pipeline(
[
("scaler", StandardScaler()),
(
"lasso",
MultiTaskLassoCV(
fit_intercept=False,
alphas=LASSO_ALPHAS,
cv=INNER_LOOP_CV,
max_iter=5000,
tol=1e-3,
random_state=SEED,
),
),
]
),
}
# Define Backtest Helper Function for Lasso Regression
def run_backtest_fold_lasso(model, X_train, Y_train, X_test, Y_test):
"""
Fits a Lasso pipeline on the training fold, predicts the test fold,
and returns all relevant predictions and performance metrics.
"""
# Fit the model pipeline
model.fit(X_train, Y_train)
# Generate in-sample (IS) and out-of-sample (OOS) predictions
Y_pred_is = model.predict(X_train)
Y_pred_oos = model.predict(X_test)
# Calculate Performance Metrics
r2_is = r2_score(Y_train, Y_pred_is, multioutput="raw_values")
r2_oos = r2_score(Y_test, Y_pred_oos, multioutput="raw_values")
hit_rate = np.mean(np.sign(Y_test.values) == np.sign(Y_pred_oos), axis=0)
# Extract Lasso-Specific Metrics
lasso_estimator = model.named_steps["lasso"]
coefs = lasso_estimator.coef_
# Count features that are non-zero for *at least one* target
n_features_selected = np.sum(np.any(coefs != 0, axis=0))
fold_results = {
"r2_in_sample": r2_is,
"r2_out_of_sample": r2_oos,
"hit_rate": hit_rate,
"selected_alpha": lasso_estimator.alpha_,
"n_features_selected": n_features_selected,
"predictions": Y_pred_oos,
"target_dates": X_test.index,
}
return fold_results
# Execute the Rolling Window Backtest
backtest_results = {model_name: [] for model_name in model_pipelines.keys()}
total_steps = len(X) - TRAIN_WINDOW_SIZE - FORECAST_HORIZON + 1
for start_idx in tqdm(range(0, total_steps, FORECAST_HORIZON)):
# Define window indices
train_end_idx = start_idx + TRAIN_WINDOW_SIZE
test_end_idx = train_end_idx + FORECAST_HORIZON
# Slice data for the current fold
X_train = X.iloc[start_idx:train_end_idx]
Y_train = Y.iloc[start_idx:train_end_idx]
X_test = X.iloc[train_end_idx:test_end_idx]
Y_test = Y.iloc[train_end_idx:test_end_idx]
# Iterate through each defined model
for model_name, pipeline in model_pipelines.items():
# Run the evaluation for this fold
results = run_backtest_fold_lasso(pipeline, X_train, Y_train, X_test, Y_test)
backtest_results[model_name].append(results)
print("Backtest complete.")
# Extract all 10 metric DataFrames
target_cols = Y.columns
r2_is_df = extract_metric_dataframe(
backtest_results, "Lasso_PCA", "r2_in_sample", target_cols
)
r2_is_df_nopca = extract_metric_dataframe(
backtest_results, "Lasso_NoPCA", "r2_in_sample", target_cols
)
r2_os_df = extract_metric_dataframe(
backtest_results, "Lasso_PCA", "r2_out_of_sample", target_cols
)
r2_os_df_nopca = extract_metric_dataframe(
backtest_results, "Lasso_NoPCA", "r2_out_of_sample", target_cols
)
hr_df = extract_metric_dataframe(backtest_results, "Lasso_PCA", "hit_rate", target_cols)
hr_df_nopca = extract_metric_dataframe(
backtest_results, "Lasso_NoPCA", "hit_rate", target_cols
)
fs_df = extract_metric_dataframe(backtest_results, "Lasso_PCA", "n_features_selected")
fs_df_nopca = extract_metric_dataframe(
backtest_results, "Lasso_NoPCA", "n_features_selected"
)
alpha_df = extract_metric_dataframe(backtest_results, "Lasso_PCA", "selected_alpha")
alpha_df_nopca = extract_metric_dataframe(
backtest_results, "Lasso_NoPCA", "selected_alpha"
)
# Create the Visualization Grid
plt.close("all")
fig, axes = plt.subplots(5, 2, figsize=(28, 35)) # Increased height for 5 rows
fig.suptitle(
"Rolling Window Lasso Performance: PCA vs. No PCA (4-Year Window, 21-Day Horizon)",
fontsize=24,
y=1.01,
)
plot_config = [
{
"df": r2_is_df,
"ax": axes[0, 0],
"title": "In-Sample RΒ² (with PCA)",
"ylim": None,
},
{
"df": r2_is_df_nopca,
"ax": axes[0, 1],
"title": "In-Sample RΒ² (no PCA)",
"ylim": None,
},
{
"df": r2_os_df,
"ax": axes[1, 0],
"title": "Out-of-Sample RΒ² (with PCA)",
"ylim": (-0.3, 0.3),
},
{
"df": r2_os_df_nopca,
"ax": axes[1, 1],
"title": "Out-of-Sample RΒ² (no PCA)",
"ylim": (-0.3, 0.3),
},
{
"df": hr_df,
"ax": axes[2, 0],
"title": "Out-of-Sample Hit Rate (with PCA)",
"ylim": None,
},
{
"df": hr_df_nopca,
"ax": axes[2, 1],
"title": "Out-of-Sample Hit Rate (no PCA)",
"ylim": None,
},
{
"df": fs_df,
"ax": axes[3, 0],
"title": "Selected Features (with PCA)",
"ylim": None,
"legend": False,
},
{
"df": fs_df_nopca,
"ax": axes[3, 1],
"title": "Selected Features (no PCA)",
"ylim": None,
"legend": False,
},
{
"df": alpha_df,
"ax": axes[4, 0],
"title": "Selected Alpha (with PCA)",
"ylim": None,
"legend": False,
},
{
"df": alpha_df_nopca,
"ax": axes[4, 1],
"title": "Selected Alpha (no PCA)",
"ylim": None,
"legend": False,
},
]
for config in plot_config:
show_legend = config.get("legend", True) # Show legend by default
config["df"].plot(ax=config["ax"], title=config["title"], legend=show_legend)
config["ax"].set_xlabel("Prediction Start Date")
config["ax"].grid(True, axis="y", linestyle="--", linewidth=0.5)
if config["ylim"]:
config["ax"].set_ylim(config["ylim"])
plt.tight_layout(rect=[0, 0.03, 1, 0.98])
plt.show()
The Lasso (L1) regression yielded results inferior to the Ridge (L2) model. This underperformance is directly attributable to the instability of its feature-selection mechanism. In numerous rolling-window periods, the LassoCV failed to converge on a useful subset of predictors, instead selecting a regularization strength (alpha) that set all feature coefficients to zero.
This all-or-nothing sparsity is far more detrimental in a low signal-to-noise environment than Ridge's L2 approach. By only shrinking coefficients, Ridge avoids this specific mode of model failure and provides a more consistent, non-zero forecast.
4.1.3. ElasticNet RegressionΒΆ
# Define the ElasticNetCV hyperparameter grid
ELASTIC_NET_ALPHAS = np.logspace(-3, 2, 20)
L1_RATIOS = [0.25, 0.5, 0.75]
# Define Model Pipelines
model_pipelines = {
"ElasticNet_PCA": Pipeline(
[
("scaler", StandardScaler()),
("pca", PCA(n_components=0.95, random_state=SEED)),
(
"elasticnet",
MultiTaskElasticNetCV(
alphas=ELASTIC_NET_ALPHAS,
cv=INNER_LOOP_CV,
l1_ratio=L1_RATIOS,
fit_intercept=False,
max_iter=5000,
random_state=SEED,
),
),
]
),
"ElasticNet_NoPCA": Pipeline(
[
("scaler", StandardScaler()),
(
"elasticnet",
MultiTaskElasticNetCV(
alphas=ELASTIC_NET_ALPHAS,
cv=INNER_LOOP_CV,
l1_ratio=L1_RATIOS,
fit_intercept=False,
max_iter=5000,
random_state=SEED,
),
),
]
),
}
# Define Backtest Helper Function
def run_backtest_fold_elasticnet(model, X_train, Y_train, X_test, Y_test):
"""
Fits an ElasticNet pipeline on the training fold, predicts the test fold,
and returns all relevant predictions and performance metrics.
"""
model.fit(X_train, Y_train)
# Generate in-sample (IS) and out-of-sample (OOS) predictions
Y_pred_is = model.predict(X_train)
Y_pred_oos = model.predict(X_test)
# Calculate Performance Metrics
r2_is = r2_score(Y_train, Y_pred_is, multioutput="raw_values")
r2_oos = r2_score(Y_test, Y_pred_oos, multioutput="raw_values")
hit_rate = np.mean(np.sign(Y_test.values) == np.sign(Y_pred_oos), axis=0)
# Extract ElasticNet-Specific Metrics
elasticnet_estimator = model.named_steps["elasticnet"]
coefs = elasticnet_estimator.coef_
# Count features that are non-zero for *at least one* target
n_features_selected = np.sum(np.any(coefs != 0, axis=0))
fold_results = {
"r2_in_sample": r2_is,
"r2_out_of_sample": r2_oos,
"hit_rate": hit_rate,
"selected_alpha": elasticnet_estimator.alpha_,
"selected_l1_ratio": elasticnet_estimator.l1_ratio_,
"n_features_selected": n_features_selected,
"predictions": Y_pred_oos,
"target_dates": X_test.index,
}
return fold_results
# Execute the Rolling Window Backtest
backtest_results = {model_name: [] for model_name in model_pipelines.keys()}
# Define the total number of steps in the backtest
total_steps = len(X) - TRAIN_WINDOW_SIZE - FORECAST_HORIZON + 1
for start_idx in tqdm(range(0, total_steps, FORECAST_HORIZON)):
# Define window indices
train_end_idx = start_idx + TRAIN_WINDOW_SIZE
test_end_idx = train_end_idx + FORECAST_HORIZON
# Slice data for the current fold
X_train = X.iloc[start_idx:train_end_idx]
Y_train = Y.iloc[start_idx:train_end_idx]
X_test = X.iloc[train_end_idx:test_end_idx]
Y_test = Y.iloc[train_end_idx:test_end_idx]
# Iterate through each defined model
for model_name, pipeline in model_pipelines.items():
# Run the evaluation for this fold
results = run_backtest_fold_elasticnet(
pipeline, X_train, Y_train, X_test, Y_test
)
backtest_results[model_name].append(results)
print("Backtest complete.")
100%|ββββββββββ| 210/210 [12:05<00:00, 3.46s/it]
Backtest complete.
# Extract all 12 metric DataFrames
target_cols = Y.columns
r2_is_df = extract_metric_dataframe(
backtest_results, "ElasticNet_PCA", "r2_in_sample", target_cols
)
r2_is_df_nopca = extract_metric_dataframe(
backtest_results, "ElasticNet_NoPCA", "r2_in_sample", target_cols
)
r2_os_df = extract_metric_dataframe(
backtest_results, "ElasticNet_PCA", "r2_out_of_sample", target_cols
)
r2_os_df_nopca = extract_metric_dataframe(
backtest_results, "ElasticNet_NoPCA", "r2_out_of_sample", target_cols
)
hr_df = extract_metric_dataframe(
backtest_results, "ElasticNet_PCA", "hit_rate", target_cols
)
hr_df_nopca = extract_metric_dataframe(
backtest_results, "ElasticNet_NoPCA", "hit_rate", target_cols
)
fs_df = extract_metric_dataframe(
backtest_results, "ElasticNet_PCA", "n_features_selected"
)
fs_df_nopca = extract_metric_dataframe(
backtest_results, "ElasticNet_NoPCA", "n_features_selected"
)
alpha_df = extract_metric_dataframe(
backtest_results, "ElasticNet_PCA", "selected_alpha"
)
alpha_df_nopca = extract_metric_dataframe(
backtest_results, "ElasticNet_NoPCA", "selected_alpha"
)
l1_df = extract_metric_dataframe(
backtest_results, "ElasticNet_PCA", "selected_l1_ratio"
)
l1_df_nopca = extract_metric_dataframe(
backtest_results, "ElasticNet_NoPCA", "selected_l1_ratio"
)
# Create the Visualization Grid
plt.close("all")
fig, axes = plt.subplots(6, 2, figsize=(28, 42)) # Increased height for 6 rows
fig.suptitle(
"Rolling Window ElasticNet Performance: PCA vs. No PCA (10-Year Window, 21-Day Horizon)",
fontsize=24,
y=1.01,
)
plot_config = [
{
"df": r2_is_df,
"ax": axes[0, 0],
"title": "In-Sample RΒ² (with PCA)",
"ylim": None,
},
{
"df": r2_is_df_nopca,
"ax": axes[0, 1],
"title": "In-Sample RΒ² (no PCA)",
"ylim": None,
},
{
"df": r2_os_df,
"ax": axes[1, 0],
"title": "Out-of-Sample RΒ² (with PCA)",
"ylim": (-0.3, 0.3),
},
{
"df": r2_os_df_nopca,
"ax": axes[1, 1],
"title": "Out-of-Sample RΒ² (no PCA)",
"ylim": (-0.3, 0.3),
},
{
"df": hr_df,
"ax": axes[2, 0],
"title": "Out-of-Sample Hit Rate (with PCA)",
"ylim": None,
},
{
"df": hr_df_nopca,
"ax": axes[2, 1],
"title": "Out-of-Sample Hit Rate (no PCA)",
"ylim": None,
},
{
"df": fs_df,
"ax": axes[3, 0],
"title": "Selected Features (with PCA)",
"ylim": None,
"legend": False,
},
{
"df": fs_df_nopca,
"ax": axes[3, 1],
"title": "Selected Features (no PCA)",
"ylim": None,
"legend": False,
},
{
"df": alpha_df,
"ax": axes[4, 0],
"title": "Selected Alpha (with PCA)",
"ylim": None,
"legend": False,
},
{
"df": alpha_df_nopca,
"ax": axes[4, 1],
"title": "Selected Alpha (no PCA)",
"ylim": None,
"legend": False,
},
{
"df": l1_df,
"ax": axes[5, 0],
"title": "Selected L1 Ratio (with PCA)",
"ylim": None,
"legend": False,
},
{
"df": l1_df_nopca,
"ax": axes[5, 1],
"title": "Selected L1 Ratio (no PCA)",
"ylim": None,
"legend": False,
},
]
for config in plot_config:
show_legend = config.get("legend", True) # Show legend by default
config["df"].plot(ax=config["ax"], title=config["title"], legend=show_legend)
config["ax"].set_xlabel("Prediction Start Date")
config["ax"].grid(True, axis="y", linestyle="--", linewidth=0.5)
if config["ylim"]:
config["ax"].set_ylim(config["ylim"])
plt.tight_layout(rect=[0, 0.03, 1, 0.98])
plt.show()
Analysis of Rolling-Window ElasticNet PerformanceΒΆ
This grid of plots compares the pipeline with PCA (left column) against the pipeline without PCA (right column). We can see that both models fail to demonstrate any statistically significant out-of-sample (OOS) predictive power. The models are severely overfit to the in-sample data.
Observations:ΒΆ
Out-of-Sample RΒ² (Row 2):
- For both the PCA and No-PCA models, the OOS RΒ² is highly volatile and centered around 0.0, frequently becoming negative. A negative OOS RΒ² signifies that the model's forecast is worse than a simple historical mean, which is a definitive sign of an overfit model that has no generalizable, out-of-sample skill.
In-Sample RΒ² (Row 1):
- The In-Sample RΒ² is consistently low (mostly between 0.0 and 0.15), indicating that the features explain very little of the variance even within the 10-year training window.
- The fact that a low-but-positive In-Sample RΒ² is paired with a zero/negative Out-of-Sample RΒ² is the classic signature of a model fitting to noise rather than a durable signal.
Hit Rate (Row 3):
- This plot confirms the OOS RΒ² finding. The directional accuracy for both models is extremely volatile and appears to average around 0.5 (50%).
- A 0.5 hit rate is equivalent to a random coin flip, confirming the models have no discernible skill in predicting the direction of yield changes.
Model Stability: Selected Features (Row 4) & Alpha (Row 5):
- The model is extremely unstable. The number of selected features (Row 4) is very volatile, jumping from near-zero to over 30 from one window to the next. The alpha (Row 5) is also unstable, oscillating between its minimum and maximum values.
As a result, both ElasticNet models, with or without PCA, are overfitting the data. They fail to produce valid out-of-sample forecasts. The instability of the selected features and hyperparameters (alpha) in the non-PCA model strongly suggests that the features themselves have no consistent, predictive linear relationship with the target yields over the 21-day horizon.
4.1.4. Conclusion from Daily Regression ModelsΒΆ
Our backtesting of regularized linear models (Ridge, Lasso, and ElasticNet) on the daily dataset do not allow us to get a generalizable, out-of-sample (OOS) predictive power, as shown by the:
- OOS R-Squared: The out-of-sample RΒ² is consistently centered around zero and is frequently negative. This indicates our models' forecasts are statistically no betterβand often worseβthan a simple historical mean.
- Hit Rate (Directional Accuracy): The hit rate for all models averages approximately 0.5, which is statistically indistinguishable from a random coin flip.
The instability and overfitting observed suggest that the daily time series is too noisy. Attempting to fit more complex, non-linear models (such as deep neural networks) to this noisy daily data would likely increase overfitting and is unlikely to yield a robust model.
4.2. Classifications on Weekly DatasetΒΆ
Methodology IntroductionΒΆ
Our analysis of the daily regression models demonstrated a severe lack of out-of-sample predictive power, with models consistently failing to outperform a simple benchmark. This suggests the daily data is dominated by high-frequency noise, and predicting the exact magnitude of yield changes is an overly complex task.
Based on these findings, we pivot our methodology in two fundamental ways to test a more tractable hypothesis:
- Problem Simplification (Regression $\rightarrow$ Classification): We move from regression to binary classification. Instead of predicting the value of the future yield, we now only predict its direction (1 for "Up," 0 for "Down/Flat"). This simplifies the problem, potentially making the underlying signal more detectable.
- Temporal Aggregation (Daily $\rightarrow$ Weekly): We aggregate our data to a weekly frequency. This acts as a natural filter, smoothing out the daily market noise and allowing the lower-frequency economic and financial signals to become more prominent.
Experimental FrameworkΒΆ
We will evaluate the performance of four distinct classification models on this new weekly task:
- Logistic Regression (with L2-CV)
- Random Forest
- XGBoost
- Long Short-Term Memory (LSTM) Network
The models are trained and evaluated using the same rigorous rolling-window framework as in the previous section: a 15-year (780-week) training window is used to predict the subsequent 4 weeks, after which the window is rolled forward.
Feature Selection: Mutual Information & PCAΒΆ
To handle our high-dimensional feature set, we employ a filter method based on Mutual Information (MI), which excels at capturing non-linear relationships. We will train our models on several distinct datasets, each created by filtering features based on a different MI threshold (e.g., 0.03, 0.035, 0.04, and 0.05) to find the optimal feature set.
A special case is made for the MI > 0.03 dataset. This set contains 266 features, which is a very high dimensionality relative to our 780-week training window. This "curse of dimensionality" can lead to model instability and overfitting. Therefore, for this specific dataset, we will test two variants of each model:
- Standard: The model trained on all 266 features.
- PCA Variant: A model trained on principal components, with
PCA(n_components=0.99)applied within the pipeline to reduce dimensionality and multicollinearity.
# Define Backtest Configuration
# Define the rolling window parameters
TRAIN_WINDOW_SIZE = 52 * 15 # 15 years of weekly data
FORECAST_HORIZON = 4 # Predict 4 weeks ahead
ROLL_STEP_SIZE = 4 # Roll the window forward by the forecast horizon
4.2.1. Logistic RegressionΒΆ
# Define Hyperparameter Grid for LogisticRegressionCV
LOGREG_CS_GRID = np.logspace(-5, 2, 30)
# Define the inner-loop cross-validator for hyperparameter tuning.
INNER_LOOP_CV = TimeSeriesSplit(n_splits=4)
# Define Helper: Backtest Function
def run_rolling_logreg_backtest(pipeline, X_data, Y_data, train_window, pred_window):
"""
Runs a full rolling-window backtest for a given Logistic Regression pipeline.
Returns tuple of DataFrames:
(accuracies, best_Cs, feature_coefs, y_predictions, y_true_all)
"""
# Lists to store results from each fold
accuracies_list = []
best_C_list = []
feature_coefs_list = []
y_pred_list = []
y_true_list = []
total_steps = len(X_data) - train_window - pred_window + 1
for start in tqdm(range(0, total_steps, pred_window)):
end_train = start + train_window
end_pred = end_train + pred_window
X_train = X_data.iloc[start:end_train]
Y_train = Y_data.iloc[start:end_train]
X_test = X_data.iloc[end_train:end_pred]
Y_test = Y_data.iloc[end_train:end_pred]
y_true_list.append(Y_test)
# Fit model
pipeline.fit(X_train, Y_train)
Y_pred = pipeline.predict(X_test)
y_pred_list.append(Y_pred)
# Initialize dicts to store metrics for this fold
fold_accuracies = {}
fold_best_C = {}
fold_feature_coefs = {}
# Extract metrics for each target (maturity)
for i, col in enumerate(Y_train.columns):
fold_accuracies[col] = accuracy_score(Y_test[col], Y_pred[:, i])
# Get the fitted estimator for this specific target
estimator = pipeline.named_steps["logreg"].estimators_[i]
fold_best_C[col] = estimator.C_[0]
fold_feature_coefs[col] = np.abs(estimator.coef_).flatten()
accuracies_list.append(fold_accuracies)
best_C_list.append(fold_best_C)
feature_coefs_list.append(fold_feature_coefs)
# Aggregate and Format Results
# Concatenate all true Y values to get the correct full index
y_true_df = pd.concat(y_true_list)
# Get the start date of each fold for the results DataFrames
fold_start_dates = y_true_df.index[::pred_window]
# Create DataFrames
acc_df = pd.DataFrame(
accuracies_list, columns=Y_data.columns, index=fold_start_dates
)
params_df = pd.DataFrame(
best_C_list, columns=Y_data.columns, index=fold_start_dates
)
feature_imp_df = pd.DataFrame(
feature_coefs_list, columns=Y_data.columns, index=fold_start_dates
)
# Create aligned prediction/true DataFrames
y_pred_df = pd.DataFrame(
np.concatenate(y_pred_list), index=y_true_df.index, columns=Y_data.columns
)
return acc_df, params_df, feature_imp_df, y_pred_df, y_true_df
# Define Helper: Save Results Function
def save_results(file_prefix, acc, params, feature_imp, ypred, ytrue):
"""Saves all result DataFrames to parquet files."""
print(f"Saving results for prefix: {file_prefix}")
acc.to_parquet(f"model_results/accuracies_{file_prefix}.parquet")
params.to_parquet(f"model_results/params_{file_prefix}.parquet")
feature_imp.to_parquet(f"model_results/feature_importance_{file_prefix}.parquet")
ypred.to_parquet(f"model_results/forecast_{file_prefix}.parquet")
ytrue.to_parquet(f"model_results/true_values_{file_prefix}.parquet")
# Main Backtest Loop
for threshold in threshold_list:
print(f"\n--- Training on Mutual Info Threshold: {threshold} ---")
Xw = datasets[threshold]
pipe_nopca = Pipeline(
[
("scaler", StandardScaler()),
(
"logreg",
MultiOutputClassifier(
LogisticRegressionCV(
penalty="l2",
cv=INNER_LOOP_CV,
Cs=LOGREG_CS_GRID,
fit_intercept=False,
scoring="accuracy",
max_iter=5000,
random_state=SEED,
)
),
),
]
)
# Run the standard (no PCA) model for this threshold
(acc_l2, params_l2, f_imp_l2, ypred_l2, ytrue_l2) = run_rolling_logreg_backtest(
pipe_nopca, Xw, Yw_binary, TRAIN_WINDOW_SIZE, FORECAST_HORIZON
)
# Save the results
file_prefix = f"logreg_MI_{threshold}"
save_results(file_prefix, acc_l2, params_l2, f_imp_l2, ypred_l2, ytrue_l2)
print(f"Average accuracy for MI > {threshold} (No PCA):")
print(acc_l2.mean())
# For the 0.03 threshold, also run the PCA variant
if threshold == 0.03:
print(f"--- Running special PCA case for MI Threshold: {threshold} ---")
pipe_pca = Pipeline(
[
("scaler", StandardScaler()),
("pca", PCA(n_components=0.99, random_state=SEED)),
(
"logreg",
MultiOutputClassifier(
LogisticRegressionCV(
penalty="l2",
cv=INNER_LOOP_CV,
Cs=LOGREG_CS_GRID,
fit_intercept=False,
scoring="accuracy",
max_iter=5000,
random_state=SEED,
)
),
),
]
)
(acc_pca, params_pca, f_imp_pca, ypred_pca, ytrue_pca) = (
run_rolling_logreg_backtest(
pipe_pca, Xw, Yw_binary, TRAIN_WINDOW_SIZE, FORECAST_HORIZON
)
)
# Save PCA results
file_prefix_pca = "logreg_pca_MI_0.03"
save_results(
file_prefix_pca, acc_pca, params_pca, f_imp_pca, ypred_pca, ytrue_pca
)
print(f"Average accuracy for MI > {threshold} (WITH PCA):")
print(acc_pca.mean())
print("All backtests complete.")
Our preliminary results for the logistic regression models appear to show a modest improvement over the daily regression task. While the daily models consistently failed to show out-of-sample (OOS) skill (OOS RΒ² $\approx$ 0 and hit rates $\approx$ 0.5), our weekly classification models seem to be achieving accuracies above the 50% random-guess baseline.
Although these results require deeper analysis, this initial finding suggests our methodological pivot may be correct:
- Classification (predicting direction) may be a more tractable problem than regression (predicting magnitude).
- Weekly data likely provides a better signal-to-noise ratio than daily data.
Given that this potential signal is visible even with a simple linear classifier (Logistic Regression), we are optimistic that more powerful, non-linear models (Random Forest, XGBoost, and LSTM) may be able to better capture the complex patterns in the data. We will now proceed to test these models on the same framework.
4.2.2. XGBoostΒΆ
# Parameters to test in cross-validation
param_grid = {
"xgb__estimator__n_estimators": [100, 200],
"xgb__estimator__learning_rate": [0.01, 0.05],
"xgb__estimator__max_depth": [4, 7],
"xgb__estimator__subsample": [0.5, 0.7],
"xgb__estimator__colsample_bytree": [0.4, 0.8],
"xgb__estimator__min_child_weight": [5, 10],
"xgb__estimator__reg_alpha": [0.5, 1.0],
"xgb__estimator__reg_lambda": [5, 10],
}
# Define Base Model Pipelines
# Pipeline without PCA
pipe = Pipeline(
[
("scaler", StandardScaler()),
(
"xgb",
MultiOutputClassifier(
XGBClassifier(
objective="binary:logistic",
use_label_encoder=False,
n_jobs=-1, # n_jobs in XGBClassifier is for tree-building
random_state=SEED,
verbosity=0,
)
),
),
]
)
# Pipeline with PCA
pipepca = Pipeline(
[
("scaler", StandardScaler()),
("pca", PCA(n_components=0.99, random_state=SEED)),
(
"xgb",
MultiOutputClassifier(
XGBClassifier(
objective="binary:logistic",
use_label_encoder=False,
n_jobs=-1,
random_state=SEED,
verbosity=0,
)
),
),
]
)
# Define Helper: Backtest Function
def run_rolling_xgb_backtest(
pipeline, X_data, Y_data, param_grid, tscv, train_window, pred_window
):
"""
Runs a full rolling-window backtest for a given XGBoost pipeline
using GridSearchCV inside each loop.
Returns tuple of DataFrames:
(accuracies, parameters, feature_importances, y_predictions, y_true_all)
"""
# Lists to store results from each fold
accuracies_list = []
params_list = []
feature_imp_list = []
y_pred_list = []
y_true_list = []
total_steps = len(X_data) - train_window - pred_window + 1
for start in tqdm(range(0, total_steps, pred_window)):
end_train = start + train_window
end_pred = end_train + pred_window
Xw_train = X_data.iloc[start:end_train]
Yw_train = Y_data.iloc[start:end_train]
Xw_test = X_data.iloc[end_train:end_pred]
Yw_test = Y_data.iloc[end_train:end_pred]
y_true_list.append(Yw_test)
# Run the (expensive) GridSearchCV on the training data
grid = GridSearchCV(
pipeline, param_grid, cv=tscv, scoring="accuracy", n_jobs=-1, verbose=0
)
grid.fit(Xw_train, Yw_train)
best_model = grid.best_estimator_
# Predict and store
Yw_pred = best_model.predict(Xw_test)
y_pred_list.append(Yw_pred)
# Initialize dicts to store metrics for this fold
fold_accuracies = {}
fold_params = {}
fold_feature_imp = {}
# Extract metrics for each target (maturity)
for i, col in enumerate(Yw_train.columns):
fold_accuracies[col] = accuracy_score(Yw_test[col], Yw_pred[:, i])
# Get the fitted estimator for this specific target
estimator = best_model.named_steps["xgb"].estimators_[i]
fold_params[col] = estimator.get_params()
fold_feature_imp[col] = estimator.feature_importances_
accuracies_list.append(fold_accuracies)
params_list.append(fold_params)
feature_imp_list.append(fold_feature_imp)
# Aggregate and Format Results
# Concatenate all true Y values to get the correct full index
ytrue_df = pd.concat(y_true_list)
# Get the start date of each fold for the results DataFrames
fold_start_dates = ytrue_df.index[::pred_window]
acc_df = pd.DataFrame(
accuracies_list, columns=Y_data.columns, index=fold_start_dates
)
params_df = pd.DataFrame(
params_list, columns=Y_data.columns, index=fold_start_dates
)
feature_imp_df = pd.DataFrame(
feature_imp_list, columns=Y_data.columns, index=fold_start_dates
)
ypred_df = pd.DataFrame(
np.concatenate(y_pred_list), index=ytrue_df.index, columns=Y_data.columns
)
return acc_df, params_df, feature_imp_df, ypred_df, ytrue_df
# Define Hyperparameter Grid for LogisticRegressionCV
LOGREG_CS_GRID = np.logspace(-5, 2, 30)
# Define the inner-loop cross-validator for hyperparameter tuning.
INNER_LOOP_CV = TimeSeriesSplit(n_splits=4)
# Main Backtest Loop
for threshold in threshold_list:
print(f"\n--- Training on Mutual Info Threshold: {threshold} ---")
Xw = datasets[threshold]
if threshold == 0.03:
# --- Run No-PCA Model ---
print("Running No-PCA model for threshold 0.03...")
(acc, params, f_imp, ypred, ytrue) = run_rolling_xgb_backtest(
pipe,
Xw,
Yw_binary,
param_grid,
INNER_LOOP_CV,
TRAIN_WINDOW_SIZE,
FORECAST_HORIZON,
)
print(f"Average accuracy (No PCA, MI > {threshold}):")
print(acc.mean())
save_results("xgboost, 15y train test", acc, params, f_imp, ypred, ytrue)
# --- Run PCA Model ---
print("Running PCA model for threshold 0.03...")
(acc_pca, params_pca, f_imp_pca, ypred_pca, ytrue_pca) = (
run_rolling_xgb_backtest(
pipepca,
Xw,
Yw_binary,
param_grid,
INNER_LOOP_CV,
TRAIN_WINDOW_SIZE,
FORECAST_HORIZON,
)
)
print(f"Average accuracy (WITH PCA, MI > {threshold}):")
print(acc_pca.mean())
save_results(
"xgboost, pca, 15y train test",
acc_pca,
params_pca,
f_imp_pca,
ypred_pca,
ytrue_pca,
)
else:
# --- Run No-PCA Model for other thresholds ---
print(f"Running No-PCA model for threshold {threshold}...")
(acc, params, f_imp, ypred, ytrue) = run_rolling_xgb_backtest(
pipe,
Xw,
Yw_binary,
param_grid,
INNER_LOOP_CV,
TRAIN_WINDOW_SIZE,
FORECAST_HORIZON,
)
print(f"Average accuracy (No PCA, MI > {threshold}):")
print(acc.mean())
save_results(
f"xgboost, MI {threshold}, 15y train test", acc, params, f_imp, ypred, ytrue
)
print("All backtests complete.")
4.2.3. Random ForestΒΆ
# Parameters to test in cross-validation
param_grid = {
"rf__estimator__n_estimators": [150],
"rf__estimator__max_depth": [4, 7],
"rf__estimator__min_samples_leaf": [5, 10],
}
# Define Base Model Pipelines
# Pipeline without PCA
pipe_nopca = Pipeline(
[
("scaler", StandardScaler()),
(
"rf",
MultiOutputClassifier(RandomForestClassifier(n_jobs=-1, random_state=SEED)),
),
]
)
# Pipeline with PCA
pipe_pca = Pipeline(
[
("scaler", StandardScaler()),
("pca", PCA(n_components=0.99, random_state=SEED)),
(
"rf",
MultiOutputClassifier(RandomForestClassifier(n_jobs=-1, random_state=SEED)),
),
]
)
# Define Helper: Backtest Function
def run_rolling_gridsearch_backtest(
pipeline, X_data, Y_data, param_grid, tscv, train_window, pred_window
):
"""
Runs a full rolling-window backtest for a given pipeline
using GridSearchCV inside each loop.
Returns tuple of DataFrames:
(accuracies, parameters, feature_importances, y_predictions, y_true_all)
"""
# Lists to store results from each fold
accuracies_list = []
params_list = []
feature_imp_list = []
y_pred_list = []
y_true_list = []
total_steps = len(X_data) - train_window - pred_window + 1
for start in tqdm(range(0, total_steps, pred_window)):
end_train = start + train_window
end_pred = end_train + pred_window
Xw_train = X_data.iloc[start:end_train]
Yw_train = Y_data.iloc[start:end_train]
Xw_test = X_data.iloc[end_train:end_pred]
Yw_test = Y_data.iloc[end_train:end_pred]
y_true_list.append(Yw_test)
grid = GridSearchCV(
pipeline, param_grid, cv=tscv, scoring="accuracy", n_jobs=-1, verbose=0
)
grid.fit(Xw_train, Yw_train)
best_model = grid.best_estimator_
# Predict and store
Yw_pred = best_model.predict(Xw_test)
y_pred_list.append(Yw_pred)
# Initialize dicts to store metrics for this fold
fold_accuracies = {}
fold_params = {}
fold_feature_imp = {}
# Extract metrics for each target (maturity)
for i, col in enumerate(Yw_train.columns):
fold_accuracies[col] = accuracy_score(Yw_test[col], Yw_pred[:, i])
# Get the fitted estimator for this specific target
estimator = best_model.named_steps["rf"].estimators_[i]
fold_params[col] = estimator.get_params()
# Note: For PCA, this will be importance on the components,
# not the original features.
fold_feature_imp[col] = estimator.feature_importances_
accuracies_list.append(fold_accuracies)
params_list.append(fold_params)
feature_imp_list.append(fold_feature_imp)
# --- Aggregate and Format Results ---
ytrue_df = pd.concat(y_true_list)
fold_start_dates = ytrue_df.index[::pred_window]
acc_df = pd.DataFrame(
accuracies_list, columns=Y_data.columns, index=fold_start_dates
)
params_df = pd.DataFrame(
params_list, columns=Y_data.columns, index=fold_start_dates
)
feature_imp_df = pd.DataFrame(
feature_imp_list, columns=Y_data.columns, index=fold_start_dates
)
# Create aligned prediction/true DataFrames
ypred_df = pd.DataFrame(
np.concatenate(y_pred_list), index=ytrue_df.index, columns=Y_data.columns
)
return acc_df, params_df, feature_imp_df, ypred_df, ytrue_df
# Define Helper: Save Results Function
def save_results_rf(file_prefix, acc, params, feature_imp, ypred, ytrue):
"""Saves all result DataFrames to parquet files."""
print(f"Saving results for prefix: {file_prefix}")
base_path = "model_results/"
acc.to_parquet(f"{base_path}accuracies_{file_prefix}.parquet")
params.to_parquet(f"{base_path}params_{file_prefix}.parquet")
feature_imp.to_parquet(f"{base_path}feature_importance_{file_prefix}.parquet")
ypred.to_parquet(f"{base_path}forecast_{file_prefix}.parquet")
ytrue.to_parquet(f"{base_path}true_values_{file_prefix}.parquet")
# Main Backtest Loop
for threshold in threshold_list:
print(f"\n--- Training on Mutual Info Threshold: {threshold} ---")
Xw = datasets[threshold]
if threshold == 0.03:
# --- Run No-PCA Model ---
print("Running No-PCA model for threshold 0.03...")
(acc, params, f_imp, ypred, ytrue) = run_rolling_gridsearch_backtest(
pipe_nopca,
Xw,
Yw_binary,
param_grid,
INNER_LOOP_CV,
TRAIN_WINDOW_SIZE,
FORECAST_HORIZON,
)
print(f"Average accuracy (No PCA, MI > {threshold}):")
print(acc.mean())
save_results("random forest, 15y train test", acc, params, f_imp, ypred, ytrue)
# --- Run PCA Model ---
print("Running PCA model for threshold 0.03...")
(acc_pca, params_pca, f_imp_pca, ypred_pca, ytrue_pca) = (
run_rolling_gridsearch_backtest(
pipe_pca,
Xw,
Yw_binary,
param_grid,
INNER_LOOP_CV,
TRAIN_WINDOW_SIZE,
FORECAST_HORIZON,
)
)
print(f"Average accuracy (WITH PCA, MI > {threshold}):")
print(acc_pca.mean())
save_results(
"random forest, pca, 15y train test",
acc_pca,
params_pca,
f_imp_pca,
ypred_pca,
ytrue_pca,
)
else:
# --- Run No-PCA Model for other thresholds ---
print(f"Running No-PCA model for threshold {threshold}...")
(acc, params, f_imp, ypred, ytrue) = run_rolling_gridsearch_backtest(
pipe_nopca,
Xw,
Yw_binary,
param_grid,
INNER_LOOP_CV,
TRAIN_WINDOW_SIZE,
FORECAST_HORIZON,
)
print(f"Average accuracy (No PCA, MI > {threshold}):")
print(acc.mean())
save_results(
f"random forest, MI {threshold}, 15y train test",
acc,
params,
f_imp,
ypred,
ytrue,
)
print("All backtests complete.")
4.2.4. Long Short Term Memory Recurrent Neural Network (LSTM)ΒΆ
4.2.4.1. LSTM FrameworkΒΆ
For our final model, the Long Short-Term Memory (LSTM) network, we focus exclusively on the richest and most complex feature set: the MI > 0.02 dataset. This dataset haven't been used for other models.
Our rationale for this is twofold. First, we want to test the primary theoretical advantage of deep learning: an LSTM, with its complex recurrent structure, is designed to learn from high-dimensional sequences and capture subtle non-linear dynamics or long-range dependencies that simpler models (like Logistic Regression or Random Forest) cannot. By providing the model with all 266 features, we are tasking it with acting as an end-to-end feature extractor, learning the optimal representation from the raw (though filtered) data.
Second, this high dimensionality (266 features vs. 780 training samples) presents a significant and well-documented risk of overfitting. To properly test the model under these conditions, we will replicate the same experiment we ran for our XGBoost and Random Forest models:
- LSTM (Standard): The model is trained on the full 266-feature set. We will rely heavily on strong regularization techniques, including L2 weight decay in the optimizer and Dropout in the network layers, to manage this high dimensionality.
initial_number_of_features = Xw.shape[1]
threshold_mi = 0.02
selected_features_w = set()
for col in Yw_binary.columns:
mi = mutual_info_classif(Xw, Yw_binary[col], random_state=SEED)
top_features_i = Xw.columns[mi > threshold_mi]
selected_features_w.update(top_features_i)
Xw_002 = Xw[list(selected_features_w)]
print(
f"{initial_number_of_features - len(selected_features_w)} features were removed for threshold {threshold_mi}."
)
print(f"So that {Xw_002.shape[1]} features are left for that threshold {threshold_mi}.")
print(
f"Features that are left, bar target related ones, are\
\n{[col for col in Xw_002.columns if not col.startswith(('Y', 'DGS'))]}."
)
158 features were removed for threshold 0.02. So that 382 features are left for that threshold 0.02. Features that are left, bar target related ones, are ['M2SL', 'WSHOSHO', 'NFCI', 'IRLTLT01CAM156N', 'GFDEGDQ188S', 'IRLTLT01GBM156N', 'CPIAUCSL', 'DNDGRG3M086SBEA', 'UNRATE', 'PERMIT', 'BOGMBASE', 'FYGFDPUN', 'PCEPI', 'CPITRNSL', 'FGDEF', 'CPIAPPSL', 'UMCSENT', 'NONREVSL', 'USGOVT', 'IRLTLT01FRM156N', 'IRLTLT01AUM156N', 'UEMP5TO14', 'M1SL', 'NASDAQCOM', 'CUUR0000SAD', 'MANEMP', 'RPI', 'IRLTLT01DEM156N', 'USFIRE', 'HOUST', 'CUSR0000SAC', 'DTCTHFNM', 'USCONS', 'USWTRADE', 'CURRCIR', 'INDPRO', 'IRLTLT01JPM156N', 'USTPU', 'CUUR0000SA0L2', 'FEDFUNDS', 'TOTRESNS', 'M2REAL', 'CPIULFSL', 'DCOILWTICO', 'NDMANEMP', 'USTRADE', 'DTCOLNVHFNM', 'SRVPRD', 'USGOOD', 'DMANEMP', 'DEXUSAL', 'MTSDS133FMS', 'GFDEBTN', 'PAYEMS', 'WSHOTSL', 'FGEXPND'].
# Helper Function: Create Sequences
def create_sequences(features, targets, lookback_weeks, forecast_horizon):
"""
Transforms time-series data into sequences for LSTM.
Args:
features (np.array): Standardized features.
targets (np.array): Binary targets (n_samples, n_maturities).
lookback_weeks (int): How many weeks of history to use (L).
forecast_horizon (int): How many weeks ahead to predict (4).
Returns:
X (np.array): Input sequences, shape (samples, lookback, n_features)
y (np.array): Target sequences, shape (samples, forecast_horizon, n_maturities)
"""
X, y = [], []
n_samples = len(features)
for i in range(n_samples):
# Find the end of this sequence
feature_end = i + lookback_weeks
target_end = feature_end + forecast_horizon
# Stop if we are at the end of the dataset
if target_end > n_samples:
break
# Get input (history) and output (future)
seq_x = features[i:feature_end, :]
seq_y = targets[feature_end:target_end, :]
X.append(seq_x)
y.append(seq_y)
return np.array(X), np.array(y)
# Define LSTM class parameters
lstm_hidden_1 = 128
lstm_hidden_2 = 64
dense_hidden = 128
dropout_rate = 0.3
class LSTMClassifier(nn.Module):
"""
PyTorch-based LSTM Classifier for multi-step, multi-output forecasting.
"""
def __init__(
self,
n_features,
n_maturities,
forecast_horizon=FORECAST_HORIZON,
lstm_hidden_1=lstm_hidden_1,
lstm_hidden_2=lstm_hidden_2,
dense_hidden=dense_hidden,
):
super(LSTMClassifier, self).__init__()
self.forecast_horizon = forecast_horizon
self.n_maturities = n_maturities
# Define Layers
self.lstm1 = nn.LSTM(
input_size=n_features,
hidden_size=lstm_hidden_1,
num_layers=1,
batch_first=True,
)
self.dropout1 = nn.Dropout(dropout_rate)
self.lstm2 = nn.LSTM(
input_size=lstm_hidden_1,
hidden_size=lstm_hidden_2,
num_layers=1,
batch_first=True,
)
self.dropout2 = nn.Dropout(dropout_rate)
self.dense1 = nn.Linear(lstm_hidden_2, dense_hidden)
self.relu = nn.ReLU()
self.output_layer = nn.Linear(dense_hidden, forecast_horizon * n_maturities)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
"""Defines the forward pass of the model."""
# x shape: (batch_size, lookback_weeks, n_features)
# lstm1 output (output, (h_n, c_n))
lstm_out1, _ = self.lstm1(x)
lstm_out1 = self.dropout1(lstm_out1)
# lstm2 output
lstm_out2, _ = self.lstm2(lstm_out1)
# Get the last hidden state from the sequence
last_hidden_state = lstm_out2[:, -1, :]
last_hidden_state = self.dropout2(last_hidden_state)
# Pass through dense layers
dense_out = self.relu(self.dense1(last_hidden_state))
# Get final flat output
flat_output = self.output_layer(dense_out)
# Apply sigmoid activation
prob_output = self.sigmoid(flat_output)
# Reshape to (batch_size, forecast_horizon, M)
final_output = prob_output.view(-1, self.forecast_horizon, self.n_maturities)
return final_output
# Define Train Parameters
epochs = 50
batch_size = 32
patience = 10
learning_rate = 1e-3
def train_model_torch(
model,
X_train,
y_train,
X_val,
y_val,
device,
epochs=epochs,
batch_size=batch_size,
patience=patience,
lr=learning_rate,
):
"""
PyTorch training and validation loop with early stopping.
"""
# A. Create DataLoaders
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataset = TensorDataset(X_val, y_val)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
# B. Define Loss and Optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=1.0e-4)
# C. Early Stopping setup
best_val_loss = float("inf")
patience_counter = 0
best_model_state = None
history = {"train_loss": [], "val_loss": []}
for epoch in range(epochs):
# Training
model.train()
train_loss = 0.0
for inputs, targets in train_loader:
inputs, targets = inputs.to(device), targets.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
train_loss += loss.item() * inputs.size(0)
train_loss /= len(train_loader.dataset)
history["train_loss"].append(train_loss)
# Validation
model.eval()
val_loss = 0.0
with torch.no_grad():
for inputs, targets in val_loader:
inputs, targets = inputs.to(device), targets.to(device)
outputs = model(inputs)
loss = criterion(outputs, targets)
val_loss += loss.item() * inputs.size(0)
val_loss /= len(val_loader.dataset)
history["val_loss"].append(val_loss)
# Early Stopping Check
if val_loss < best_val_loss:
best_val_loss = val_loss
patience_counter = 0
# Save the best model state
best_model_state = copy.deepcopy(model.state_dict())
else:
patience_counter += 1
if patience_counter >= patience:
break
# Restore best model weights
if best_model_state:
model.load_state_dict(best_model_state)
return model, history # Return the best (early-stopped) trained model
# Define Rolling Forecast Parameters
ROLL_STEP = 4
LOOKBACK_WEEKS = 52
use_pca = False
pca_variance_threshold = 0.99
def run_rolling_forecast_torch(
all_features_df,
all_targets_df,
n_maturities,
lookback_weeks=LOOKBACK_WEEKS,
train_val_window_weeks=TRAIN_WINDOW_SIZE,
val_split_ratio=0.2,
forecast_horizon=FORECAST_HORIZON,
roll_step_weeks=ROLL_STEP,
use_pca=use_pca,
pca_variance_threshold=pca_variance_threshold,
):
"""
Conducts the full rolling-window forecast using PyTorch.
"""
model_type = "LSTM with PCA" if use_pca else "LSTM (No PCA)"
print(f"Starting PyTorch rolling forecast for: {model_type}...")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
all_predictions = []
last_fold_history = None
val_weeks = int(train_val_window_weeks * val_split_ratio)
train_weeks = train_val_window_weeks - val_weeks
start_pred_idx = train_val_window_weeks
for t in tqdm(
range(
start_pred_idx, len(all_features_df) - forecast_horizon + 1, roll_step_weeks
)
):
# A. Define Window Indices
pred_input_start = t - lookback_weeks
pred_input_end = t
val_end = pred_input_end
train_start = val_end - train_val_window_weeks
if train_start < 0:
continue
# B. Get Data Slices
train_val_features_df = all_features_df.iloc[train_start:val_end]
test_lookback_features_df = all_features_df.iloc[
pred_input_start:pred_input_end
]
train_val_targets_df = all_targets_df.iloc[
train_start : val_end + forecast_horizon
]
# C. Standardize Features (and optionally apply PCA)
scaler = StandardScaler()
train_only_features = train_val_features_df.iloc[:train_weeks]
# Fit ONLY on training data
scaler.fit(train_only_features.values)
n_features_for_model = all_features_df.shape[1] # Default
if use_pca:
pca = PCA(n_components=pca_variance_threshold)
pca.fit(train_only_features)
n_features_for_model = pca.n_components_
# Transform all feature blocks
train_val_features_transformed = pca.transform(
scaler.transform(train_val_features_df.values)
)
test_lookback_features_transformed = pca.transform(
scaler.transform(test_lookback_features_df.values)
)
else:
# Just use the scaled features
train_val_features_transformed = scaler.transform(
train_val_features_df.values
)
test_lookback_features_transformed = scaler.transform(
test_lookback_features_df.values
)
# D. Create Sequences
X_train_val_seq, y_train_val_seq = create_sequences(
train_val_features_transformed,
train_val_targets_df.values,
lookback_weeks,
forecast_horizon,
)
if len(X_train_val_seq) == 0:
continue
# E. Split Sequences
split_idx = train_weeks - lookback_weeks + 1
if split_idx <= 0 or split_idx >= len(X_train_val_seq):
continue
X_train, y_train = X_train_val_seq[:split_idx], y_train_val_seq[:split_idx]
X_val, y_val = X_train_val_seq[split_idx:], y_train_val_seq[split_idx:]
if len(X_train) == 0 or len(X_val) == 0:
continue
# F. Convert to PyTorch Tensors
X_train_tensor = torch.from_numpy(X_train).float().to(device)
y_train_tensor = torch.from_numpy(y_train).float().to(device)
X_val_tensor = torch.from_numpy(X_val).float().to(device)
y_val_tensor = torch.from_numpy(y_val).float().to(device)
# G. Build, Train, and Predict
model = LSTMClassifier(
n_features=n_features_for_model,
n_maturities=n_maturities,
forecast_horizon=forecast_horizon,
).to(device)
model, history = train_model_torch(
model,
X_train_tensor,
y_train_tensor,
X_val_tensor,
y_val_tensor,
device,
)
last_fold_history = history
# H. Make Out-of-Sample Prediction
X_pred_tensor = (
torch.from_numpy(
test_lookback_features_transformed.reshape(1, lookback_weeks, -1)
)
.float()
.to(device)
)
model.eval()
with torch.no_grad():
y_pred_probs_tensor = model(X_pred_tensor)
y_pred_probs = y_pred_probs_tensor.cpu().numpy()[0]
# I. Store Prediction
prediction_date = all_features_df.index[pred_input_end - 1]
target_dates = all_targets_df.index[t : t + forecast_horizon]
for week_ahead in range(forecast_horizon):
for mat_idx, mat_name in enumerate(all_targets_df.columns):
pred_prob = y_pred_probs[week_ahead, mat_idx]
target_date = target_dates[week_ahead]
all_predictions.append(
{
"Prediction_Date": prediction_date,
"Target_Date": target_date,
"Forecast_Task": f"{mat_name}_t+{week_ahead + 1}",
"Prob_Up": pred_prob,
}
)
print("PyTorch rolling forecast complete.")
final_predictions_df = pd.DataFrame(all_predictions)
final_predictions_df = final_predictions_df.drop_duplicates(
subset=["Target_Date", "Forecast_Task"], keep="first"
)
return final_predictions_df.set_index("Target_Date"), last_fold_history
def evaluate_aggregated_accuracy(oos_predictions, all_targets_df, threshold=0.5):
"""
Evaluates predictions by *aggregating* the 4-week forecast horizon.
It returns one accuracy score per model (Prediction_Date) per Maturity.
Args:
oos_predictions (pd.DataFrame): The 'long' DataFrame from run_rolling_forecast_torch.
all_targets_df (pd.DataFrame): Original DataFrame with true binary directions.
threshold (float): The probability threshold to classify '1' (Up).
Returns:
pd.DataFrame: A wide DataFrame where:
Index = Prediction_Date (the model ID)
Columns = Maturities (e.g., '3Y', '5Y')
Values = Aggregated accuracy over the 4-week horizon
"""
# A. Prepare True Values
true_targets_long = all_targets_df.stack()
true_targets_long = true_targets_long.reset_index()
true_targets_long.rename(
columns={"level_0": "Target_Date", "level_1": "Maturity"}, inplace=True
)
# B. Prepare Predictions
preds_df = oos_predictions.reset_index().copy()
# Extract the base maturity name (e.g., "Y_10year_t+1" -> "Y_10year")
preds_df["Maturity"] = preds_df["Forecast_Task"].apply(lambda x: x[:-4])
# C. Merge True and Predicted Values
eval_df = pd.merge(
preds_df,
true_targets_long,
on=["Target_Date", "Maturity"],
how="inner",
).rename(columns={0: "True_Value"})
if eval_df.empty:
print("Error: Merge failed. No matching (Target_Date, Maturity) pairs found.")
return pd.DataFrame()
# Create binary predictions from probabilities
eval_df["Pred_Value"] = (eval_df["Prob_Up"] >= threshold).astype(int)
# D. Group by Model and Maturity
# We group by the 'Prediction_Date' (which identifies the model)
# and the 'Maturity' (which identifies the target yield).
grouped = eval_df.groupby(["Prediction_Date", "Maturity"])
# E. Calculate Aggregated Accuracy
def calculate_accuracy(group):
# Calculate one accuracy score for the whole 4-week group
return accuracy_score(group["True_Value"], group["Pred_Value"])
accuracy_series = grouped.apply(calculate_accuracy)
accuracy_series.name = "Aggregated_Accuracy"
# F. Pivot to Desired Format
final_accuracy_df = accuracy_series.unstack(level="Maturity")
final_accuracy_df.index = [i for i in range(len(final_accuracy_df))]
return final_accuracy_df
4.2.4.2. Running LSTMΒΆ
X_lstm = Xw_002
total_weeks = X_lstm.shape[0]
n_features = X_lstm.shape[1]
maturities = Yw_binary.columns
n_maturities = len(maturities)
oos_predictions_torch, last_history = run_rolling_forecast_torch(
Xw,
Yw_binary,
n_maturities=n_maturities,
lookback_weeks=LOOKBACK_WEEKS,
train_val_window_weeks=TRAIN_WINDOW_SIZE,
val_split_ratio=0.2,
forecast_horizon=FORECAST_HORIZON,
roll_step_weeks=ROLL_STEP,
)
Starting PyTorch rolling forecast for: LSTM (No PCA)...
100%|ββββββββββ| 79/79 [1:52:30<00:00, 85.45s/it]
PyTorch rolling forecast complete.
oos_predictions_torch.to_csv("model_results/forecast_LSTM_MI_0.03.csv")
try:
aggregated_accuracy_report = evaluate_aggregated_accuracy(
oos_predictions_torch, Yw_binary, threshold=0.5
)
print("\n--- Aggregated Accuracy per Model and Maturity ---")
print(aggregated_accuracy_report.head())
except NameError:
print("Run the rolling forecast first to generate 'oos_predictions_torch'.")
--- Aggregated Accuracy per Model and Maturity --- Maturity Y_DGS1 Y_DGS10 Y_DGS1MO Y_DGS2 Y_DGS20 Y_DGS3 Y_DGS30 \ 0 0.75 0.75 0.25 0.75 0.75 0.75 0.50 1 0.75 0.75 0.50 0.75 0.75 0.75 0.75 2 0.75 0.50 0.50 0.50 0.25 0.50 0.25 3 0.75 0.75 0.50 0.50 0.75 0.75 0.75 4 0.75 0.25 0.75 0.75 1.00 0.25 0.75 Maturity Y_DGS3MO Y_DGS5 Y_DGS6MO Y_DGS7 0 0.25 1.00 0.75 1.00 1 0.50 0.75 0.50 0.50 2 0.50 0.50 0.25 0.75 3 0.50 0.75 0.75 0.75 4 0.75 0.50 0.50 0.75
/var/folders/l4/rqys14ws2955nprxbdsj722r0000gn/T/ipykernel_7822/2618273341.py:55: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning. accuracy_series = grouped.apply(calculate_accuracy)
aggregated_accuracy_report.to_csv("model_results/accuracies_LSTM_MI_0.03.csv")
# --- 2. Plot the loss curves ---
if last_history:
print("\nPlotting loss curves from the last training fold...")
train_loss = last_history["train_loss"]
val_loss = last_history["val_loss"]
plt.figure(figsize=(10, 6))
plt.plot(train_loss, label="Training Loss", color="blue")
plt.plot(val_loss, label="Validation Loss", color="orange", linestyle="--")
# Find the epoch with the best validation loss
best_epoch = np.argmin(val_loss)
plt.axvline(
best_epoch,
color="red",
linestyle=":",
label=f"Best Model (Epoch {best_epoch + 1})",
)
plt.title("Training vs. Validation Loss (Last Fold)")
plt.xlabel("Epoch")
plt.ylabel("Binary Cross-Entropy Loss")
plt.legend()
plt.grid(True, linestyle="--", alpha=0.6)
plt.show()
else:
print("No history was captured. Did the training loop run?")
Plotting loss curves from the last training fold...
Analysis of LSTM Training vs. Validation LossΒΆ
This plot clearly illustrates that the model begins to overfit almost immediately. The regularization (Dropout and L2 weight decay) is insufficient to control the model's high complexity.
The Best Model (Epoch 1): The model achieves its lowest validation loss at the very beginning of training (at index 0, labeled "Epoch 1"). This is the single point where the model found a generalizable pattern. The
EarlyStoppingmechanism (red dotted line) correctly identifies this as the best model.The Divergence (Epoch 1 onward): Immediately after this point, the two loss curves diverge dramatically.
- Training Loss (Blue): Steadily decreases from ~0.69 to ~0.62. This shows the model is successfully "memorizing" the specific noise and patterns of the training data.
- Validation Loss (Orange): Steadily increases from ~0.70 to ~0.77. This is the critical sign of overfitting. It shows that as the model gets "better" at the training data, it is simultaneously getting worse at predicting unseen data from the validation set.
5. Binary Classification Models AnalysisΒΆ
5.1. McNemar testsΒΆ
Model Significance: McNemar's Test vs. Naive BenchmarkΒΆ
To validate that our models are learning a genuine predictive signal, we must test their performance against a non-trivial baseline. Accuracy scores alone can be misleading, especially in datasets with class imbalance (where one direction, "Up" or "Down", is far more common).
To this end, we will employ McNemar's test, a non-parametric test designed to compare the classification errors of two paired models.
The Benchmark: Majority Class ClassifierΒΆ
Our benchmark is a naive "majority class" classifier. This model is not static; it is re-trained on each of the 79 rolling-window training sets. For each window, it identifies the most frequent class (e.g., "Up") and then predicts only that class for the subsequent 4-week test period.
Test RationaleΒΆ
Our hypothesis is that our trained models (Logistic, XGBoost, LSTM) are statistically superior to this naive benchmark. The McNemar test will compare the paired forecast errors of our model and the majority class classifier.
If the test rejects the null hypothesis (that the two models have the same disagreement rate), it provides statistical evidence that our models are effectively learning from the features and are not just defaulting to the most common outcome.
We build the majority class classifier below:
majority_class_classifier = {col: [] for col in Yw_binary.columns}
for start in range(
0, len(Xw) - TRAIN_WINDOW_SIZE - FORECAST_HORIZON + 1, FORECAST_HORIZON
):
end_train = start + TRAIN_WINDOW_SIZE
end_pred = end_train + FORECAST_HORIZON
for col in Yw_binary.columns:
Yw_binary_train = Yw_binary.iloc[start:end_train][col]
if Yw_binary_train.mean() < 0.5:
majority_class_classifier[col].extend([0, 0, 0, 0])
else:
majority_class_classifier[col].extend([1, 1, 1, 1])
for col in Yw_binary.columns:
majority_class_classifier[col].pop(0)
majority_class_classifier = pd.DataFrame(
majority_class_classifier, index=Yw_binary.iloc[780 : Yw_binary.shape[0] - 1].index
)
a = (
(majority_class_classifier == Yw_binary.iloc[780 : Yw_binary.shape[0] - 1])
.astype(int)
.mean()
)
a
Y_DGS1MO 0.590476 Y_DGS3MO 0.555556 Y_DGS6MO 0.523810 Y_DGS1 0.555556 Y_DGS2 0.526984 Y_DGS3 0.507937 Y_DGS5 0.507937 Y_DGS7 0.504762 Y_DGS10 0.507937 Y_DGS20 0.523810 Y_DGS30 0.511111 dtype: float64
We can now run the McNemar test:
forecast = {
"lr 0.03": pd.read_csv(
"results of models/forecast logistic regression, 15y train test.csv",
index_col=0,
),
"lr 0.03 pca": pd.read_csv(
"results of models/forecast logistic regression pca, 15y train test.csv",
index_col=0,
),
"lr 0.035": pd.read_csv(
"results of models/forecast logistic regression, MI 0.035, 15y train test.csv",
index_col=0,
),
"lr 0.04": pd.read_csv(
"results of models/forecast logistic regression, MI 0.04, 15y train test.csv",
index_col=0,
),
"lr 0.05": pd.read_csv(
"results of models/forecast logistic regression, MI 0.05, 15y train test.csv",
index_col=0,
),
"rf 0.03": pd.read_csv(
"results of models/forecast random forest, no pca, 15y train test.csv",
index_col=0,
),
"rf 0.03 pca": pd.read_csv(
"results of models/forecast random forest pca, 15y train test.csv", index_col=0
),
"rf 0.035": pd.read_csv(
"results of models/forecast random forest, MI 0.035, 15y train test.csv",
index_col=0,
),
"rf 0.04": pd.read_csv(
"results of models/forecast random forest, MI 0.04, 15y train test.csv",
index_col=0,
),
"rf 0.05": pd.read_csv(
"results of models/forecast random forest, MI 0.05, 15y train test.csv",
index_col=0,
),
"xg 0.03": pd.read_csv(
"results of models/forecast xgboost, 15y train test.csv", index_col=0
),
"xg 0.035": pd.read_csv(
"results of models/forecast xgboost, MI 0.035, 15y train test.csv", index_col=0
),
"xg 0.04": pd.read_csv(
"results of models/forecast xgboost, MI 0.04, 15y train test.csv", index_col=0
),
"xg 0.05": pd.read_csv(
"results of models/forecast xgboost, MI 0.05, 15y train test.csv", index_col=0
),
"LSTM 0.02": (
pd.read_csv(
"results of models/forecast_LSTM_cls_12864128_0.02MI_2patience_52LB_evalsscores.csv",
index_col=0,
)
> 0.05
).astype(int),
}
pvalues = pd.DataFrame(index=forecast.keys(), columns=majority_class_classifier.columns)
for key in forecast.keys():
for col in majority_class_classifier.columns:
y_pred = forecast[key][col][1:]
y_benchmark = majority_class_classifier[col]
y_true = Yw_binary.iloc[780 : Yw_binary.shape[0] - 1][col]
# McNemar test
correct_real = y_pred == y_true
correct_benchmark = y_benchmark == y_true
table = confusion_matrix(correct_real, correct_benchmark)
result = mcnemar(table, exact=False, correction=True)
pvalues.loc[key, col] = result.pvalue
Below are the p-value of McNemar test for each of the models trained, for each yield in the yield curve. The $H_0$ hypothesis of this test is : "The model has no better error rate than the majority class classifier".
Each name of column of this dataframe is in 3 parts:
- lr or rf or xg or lstm depending on whether the model is a logistic regression, random forest, xg boost or lstm model,
- 0.03 or 0.035 or 0.04 or 0.05 depending on the dataset on which the model was trained. As a reminder, this number represents the threshold of mutual information. For instance, for the dataset labelled as 0.03 we kept only features with a mutual information shared with the targets above 0.03.
- pca or " " depending on whether we applied a PCA to reduce dimensionality before training the model.
For instance, the first cell of the model is associated to the logistic regression model trained on the dataset of features with mutual information above 0.03 and on the 1 month yields.
pvalues = pvalues.T
pvalues = pvalues.add_prefix("p-value of McNemar test for ")
pvalues.index = [
"1m yield",
"3m yield",
"6m yield",
"1y yield",
"2y yield",
"3y yield",
"5y yield",
"7y yield",
"10y yield",
"20y yield",
"30y yield",
]
pvalues
| p-value of McNemar test for lr 0.03 | p-value of McNemar test for lr 0.03 pca | p-value of McNemar test for lr 0.035 | p-value of McNemar test for lr 0.04 | p-value of McNemar test for lr 0.05 | p-value of McNemar test for rf 0.03 | p-value of McNemar test for rf 0.03 pca | p-value of McNemar test for rf 0.035 | p-value of McNemar test for rf 0.04 | p-value of McNemar test for rf 0.05 | p-value of McNemar test for xg 0.03 | p-value of McNemar test for xg 0.035 | p-value of McNemar test for xg 0.04 | p-value of McNemar test for xg 0.05 | p-value of McNemar test for LSTM 0.02 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1m yield | 0.790882 | 0.720515 | 0.403364 | 0.219303 | 0.374887 | 0.093533 | 0.417304 | 0.156069 | 0.101707 | 0.053631 | 0.07305 | 0.321062 | 0.442356 | 0.271818 | 0.001604 |
| 3m yield | 0.000821 | 0.001401 | 0.000009 | 0.000011 | 0.015663 | 0.000025 | 0.00922 | 0.000025 | 0.00001 | 0.000093 | 0.000159 | 0.000058 | 0.000069 | 0.000285 | 0.055405 |
| 6m yield | 0.002985 | 0.004841 | 0.000043 | 0.001451 | 0.005075 | 0.000079 | 0.21516 | 0.000121 | 0.000045 | 0.000051 | 0.000128 | 0.000017 | 0.000025 | 0.000386 | 0.424278 |
| 1y yield | 0.749549 | 0.739752 | 0.080567 | 0.130646 | 0.412278 | 0.309941 | 0.331975 | 0.353263 | 0.232461 | 0.133614 | 1.0 | 0.135593 | 0.054292 | 0.844519 | 0.055405 |
| 2y yield | 0.43437 | 0.21746 | 0.223811 | 0.774537 | 0.613431 | 0.386476 | 0.606905 | 0.34254 | 1.0 | 0.538167 | 1.0 | 0.748774 | 0.877371 | 0.863832 | 0.367324 |
| 3y yield | 0.377638 | 0.364346 | 0.540686 | 0.417405 | 0.560624 | 0.515735 | 0.589639 | 0.57246 | 0.302895 | 0.135697 | 0.719438 | 0.163734 | 0.498962 | 0.689157 | 0.821688 |
| 5y yield | 1.0 | 0.260798 | 0.347262 | 0.452264 | 0.882466 | 0.307101 | 0.089887 | 0.930111 | 0.855725 | 0.850764 | 0.470486 | 0.748774 | 0.770493 | 0.441418 | 0.821688 |
| 7y yield | 0.272832 | 0.525793 | 0.487453 | 0.65738 | 0.943628 | 0.34254 | 0.200994 | 0.846687 | 0.658531 | 0.187836 | 0.626496 | 1.0 | 0.877371 | 1.0 | 0.910279 |
| 10y yield | 0.47399 | 0.751076 | 0.636309 | 0.605822 | 0.607766 | 1.0 | 0.740422 | 0.594032 | 0.921127 | 0.470298 | 0.823063 | 0.453255 | 0.037635 | 0.605577 | 0.821688 |
| 20y yield | 0.632044 | 0.810181 | 0.818022 | 0.583588 | 0.059948 | 0.251452 | 0.830218 | 0.152661 | 0.395092 | 0.778729 | 0.422678 | 0.75183 | 0.19043 | 0.371093 | 0.430223 |
| 30y yield | 0.767468 | 1.0 | 0.558536 | 0.697936 | 0.263786 | 0.377494 | 0.533829 | 0.223391 | 0.597312 | 0.672604 | 0.4795 | 0.813664 | 0.382733 | 0.546494 | 0.735317 |
Remark: we had to remove the XGboost model with PCA because it only predicted some zeros, which resulted in errors when running the test.
The models with a performance significatively better than the constant benchmark classifier (pvalue <0.1) are listed below. There is a 1 in the cell if the model forecasting power is statistically significant and a 0 otherwise:
pval = (pvalues < 0.1).astype(int)
pval.loc["Number of significative models"] = pval.sum(axis=0)
pval.columns = forecast.keys()
pval.add_prefix("Statistical significance of performance of ")
pval
| lr 0.03 | lr 0.03 pca | lr 0.035 | lr 0.04 | lr 0.05 | rf 0.03 | rf 0.03 pca | rf 0.035 | rf 0.04 | rf 0.05 | xg 0.03 | xg 0.035 | xg 0.04 | xg 0.05 | LSTM 0.02 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1m yield | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
| 3m yield | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 6m yield | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
| 1y yield | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 2y yield | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3y yield | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5y yield | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7y yield | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 10y yield | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 20y yield | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 30y yield | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Number of significative models | 2 | 2 | 3 | 2 | 3 | 3 | 2 | 2 | 2 | 3 | 3 | 2 | 4 | 2 | 3 |
We can see that there are large disparities in terms of statistical significance for the different models and different parts of the yield curve:
- We can see that most models have a statistically significant forecasting power on the short end of the yield curve, and particularly for the 3-month and 6-month yields.
- no model manages to have a statistically significant forecasting power on the 2y yield, 3y yield, 7y yield and 30y yield; this is not surprising, because long-term yields are not as much driven by macroeconomic, financial and monetary variables as short-term yields. Long-term yields mostly depend on expectations of future short rates over many years, which are latent and difficult to model and observe. As a result, long-term yields are more difficult to forecast.
Generally, the models that best succeed in producing classifiers with a statistically significant forecasting power are XGBoost and Random Forest. On this dataset, logistic regression produces slightly fewer significantly performing classifiers. This therefore illustrates an advantage of tree-based models, which manage to capture nonlinearities in the data and thus outperform logistic regression.
5.2. Analysis of the out-of-sample accuracy of the modelsΒΆ
Now, we can observe and analyze the average out-of-sample accuracy of all the models that were trained over the 79 fifteen years rolling windows:
accuracies = {
"lr 0.03": pd.read_csv(
"results of models/accuracies logistic regression, 15y train test.csv",
index_col=0,
).mean(),
"lr 0.03 pca": pd.read_csv(
"results of models/accuracies logistic regression pca, 15y train test.csv",
index_col=0,
).mean(),
"lr 0.035": pd.read_csv(
"results of models/accuracies logistic regression, MI 0.035, 15y train test.csv",
index_col=0,
).mean(),
"lr 0.04": pd.read_csv(
"results of models/accuracies logistic regression, MI 0.04, 15y train test.csv",
index_col=0,
).mean(),
"lr 0.05": pd.read_csv(
"results of models/accuracies logistic regression, MI 0.05, 15y train test.csv",
index_col=0,
).mean(),
"rf 0.03": pd.read_csv(
"results of models/accuracies random forest, no pca, 15y train test.csv",
index_col=0,
).mean(),
"rf 0.03 pca": pd.read_csv(
"results of models/accuracies random forest pca, 15y train test.csv",
index_col=0,
).mean(),
"rf 0.035": pd.read_csv(
"results of models/accuracies random forest, MI 0.035, 15y train test.csv",
index_col=0,
).mean(),
"rf 0.04": pd.read_csv(
"results of models/accuracies random forest, MI 0.04, 15y train test.csv",
index_col=0,
).mean(),
"rf 0.05": pd.read_csv(
"results of models/accuracies random forest, MI 0.05, 15y train test.csv",
index_col=0,
).mean(),
"xg 0.03": pd.read_csv(
"results of models/accuracies xgboost, 15y train test.csv", index_col=0
).mean(),
"xg 0.03 pca": pd.read_csv(
"results of models/accuracies xgboost, pca, 15y train test.csv", index_col=0
).mean(),
"xg 0.035": pd.read_csv(
"results of models/accuracies xgboost, MI 0.035, 15y train test.csv",
index_col=0,
).mean(),
"xg 0.04": pd.read_csv(
"results of models/accuracies xgboost, MI 0.04, 15y train test.csv", index_col=0
).mean(),
"xg 0.05": pd.read_csv(
"results of models/accuracies xgboost, MI 0.05, 15y train test.csv", index_col=0
).mean(),
}
accuracies = pd.DataFrame(accuracies)
accuracies_lstm = pd.read_csv(
"results of models/accuracies_lstm_cls.csv", index_col=0
).mean()
accuracies_lstm.name = "LSTM 0.02"
accuracies = pd.concat([accuracies, accuracies_lstm], axis=1).round(2)
The dataframe below represents the average out-of-sample accuracy of each model over the 79 15 years rolling windows. The columns are the performance of each model.
accuracies.index = [
"1m yield",
"3m yield",
"6m yield",
"1y yield",
"2y yield",
"3y yield",
"5y yield",
"7y yield",
"10y yield",
"20y yield",
"30y yield",
]
accuracies = accuracies.add_prefix("average out-of-sample accuracy for ")
accuracies[
"average out-of-sample accuracy for the majority class classifier benchmark"
] = a.values.round(2)
accuracies
| average out-of-sample accuracy for lr 0.03 | average out-of-sample accuracy for lr 0.03 pca | average out-of-sample accuracy for lr 0.035 | average out-of-sample accuracy for lr 0.04 | average out-of-sample accuracy for lr 0.05 | average out-of-sample accuracy for rf 0.03 | average out-of-sample accuracy for rf 0.03 pca | average out-of-sample accuracy for rf 0.035 | average out-of-sample accuracy for rf 0.04 | average out-of-sample accuracy for rf 0.05 | average out-of-sample accuracy for xg 0.03 | average out-of-sample accuracy for xg 0.03 pca | average out-of-sample accuracy for xg 0.035 | average out-of-sample accuracy for xg 0.04 | average out-of-sample accuracy for xg 0.05 | average out-of-sample accuracy for LSTM 0.02 | average out-of-sample accuracy for the majority class classifier benchmark | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1m yield | 0.58 | 0.61 | 0.62 | 0.64 | 0.63 | 0.64 | 0.61 | 0.64 | 0.64 | 0.65 | 0.64 | 0.59 | 0.62 | 0.61 | 0.62 | 0.63 | 0.59 |
| 3m yield | 0.68 | 0.67 | 0.71 | 0.71 | 0.65 | 0.70 | 0.62 | 0.70 | 0.70 | 0.69 | 0.68 | 0.55 | 0.69 | 0.69 | 0.68 | 0.62 | 0.56 |
| 6m yield | 0.64 | 0.63 | 0.68 | 0.65 | 0.64 | 0.67 | 0.57 | 0.67 | 0.67 | 0.67 | 0.66 | 0.50 | 0.68 | 0.67 | 0.66 | 0.63 | 0.52 |
| 1y yield | 0.57 | 0.57 | 0.62 | 0.62 | 0.59 | 0.59 | 0.54 | 0.59 | 0.59 | 0.60 | 0.56 | 0.56 | 0.53 | 0.52 | 0.56 | 0.57 | 0.56 |
| 2y yield | 0.56 | 0.59 | 0.58 | 0.54 | 0.55 | 0.56 | 0.54 | 0.56 | 0.53 | 0.55 | 0.53 | 0.53 | 0.52 | 0.52 | 0.53 | 0.57 | 0.53 |
| 3y yield | 0.55 | 0.55 | 0.53 | 0.54 | 0.53 | 0.53 | 0.53 | 0.53 | 0.55 | 0.56 | 0.50 | 0.51 | 0.48 | 0.49 | 0.50 | 0.50 | 0.51 |
| 5y yield | 0.50 | 0.55 | 0.55 | 0.54 | 0.50 | 0.54 | 0.56 | 0.51 | 0.52 | 0.52 | 0.53 | 0.51 | 0.50 | 0.50 | 0.49 | 0.49 | 0.51 |
| 7y yield | 0.55 | 0.53 | 0.53 | 0.53 | 0.51 | 0.54 | 0.54 | 0.51 | 0.52 | 0.55 | 0.49 | 0.50 | 0.51 | 0.50 | 0.51 | 0.47 | 0.50 |
| 10y yield | 0.54 | 0.52 | 0.48 | 0.48 | 0.48 | 0.51 | 0.52 | 0.49 | 0.50 | 0.53 | 0.51 | 0.51 | 0.49 | 0.47 | 0.50 | 0.48 | 0.51 |
| 20y yield | 0.50 | 0.51 | 0.51 | 0.49 | 0.43 | 0.49 | 0.53 | 0.47 | 0.49 | 0.53 | 0.51 | 0.52 | 0.52 | 0.50 | 0.51 | 0.52 | 0.52 |
| 30y yield | 0.53 | 0.51 | 0.48 | 0.49 | 0.46 | 0.48 | 0.53 | 0.47 | 0.49 | 0.49 | 0.53 | 0.51 | 0.52 | 0.50 | 0.50 | 0.52 | 0.51 |
For each maturity in the yield curve, we will identify the model providing the best average out-of-sample accuracy over the 79 15 years rolling training periods:
best_models = pd.DataFrame(
{
"Best model": [col[35:] for col in accuracies.idxmax(axis=1)],
"Best average out-of-sample accuracy": accuracies.max(axis=1),
"statistical significance of the forecasting power": [
pval.loc[ind, x]
for ind, x in zip(
accuracies.idxmax(axis=1).index,
[col[35:] for col in accuracies.idxmax(axis=1)],
)
],
}
)
best_models
| Best model | Best average out-of-sample accuracy | statistical significance of the forecasting power | |
|---|---|---|---|
| 1m yield | rf 0.05 | 0.65 | 1 |
| 3m yield | lr 0.035 | 0.71 | 1 |
| 6m yield | lr 0.035 | 0.68 | 1 |
| 1y yield | lr 0.035 | 0.62 | 1 |
| 2y yield | lr 0.03 pca | 0.59 | 0 |
| 3y yield | rf 0.05 | 0.56 | 0 |
| 5y yield | rf 0.03 pca | 0.56 | 1 |
| 7y yield | lr 0.03 | 0.55 | 0 |
| 10y yield | lr 0.03 | 0.54 | 0 |
| 20y yield | rf 0.03 pca | 0.53 | 0 |
| 30y yield | lr 0.03 | 0.53 | 0 |
We can also provide the best statistically significant model for each point of the yield curve for which we have one:
pvalc = pval.drop(labels=["Number of significative models"], axis=0)
ac = accuracies
ac = ac.drop(
labels=[
"average out-of-sample accuracy for rf 0.03 pca",
"average out-of-sample accuracy for the majority class classifier benchmark",
],
axis=1,
)
ac.columns = pvalc.columns
acc_sig = ac.where(pvalc == 1)
best_models = pd.DataFrame(
{"best significant model": acc_sig.idxmax(axis=1), "accuracy": acc_sig.max(axis=1)}
)
best_models.dropna()
/var/folders/l4/rqys14ws2955nprxbdsj722r0000gn/T/ipykernel_17680/3088785327.py:14: FutureWarning: The behavior of DataFrame.idxmax with all-NA values, or any-NA and skipna=False, is deprecated. In a future version this will raise ValueError
{"best significant model": acc_sig.idxmax(axis=1), "accuracy": acc_sig.max(axis=1)}
| best significant model | accuracy | |
|---|---|---|
| 1m yield | rf 0.03 | 0.64 |
| 3m yield | lr 0.035 | 0.71 |
| 6m yield | lr 0.035 | 0.68 |
| 1y yield | lr 0.035 | 0.62 |
| 5y yield | rf 0.03 pca | 0.51 |
| 10y yield | xg 0.04 | 0.47 |
| 20y yield | lr 0.05 | 0.43 |
Some important observations on the models, the features used and the overall accuracy:ΒΆ
we can see that in most cases, logistic regression is the model performing the highest average out-of-sample accuracy. 7 out of 11 yield maturities are better forecast with logistic regression. But only 3 of those models are significative.
Concerning the 4 other yield maturities that are not better forecast with logistic regression, the model with the highest average out-of-sample accuracy is the random forest.
Overall, our models manage to get a satisfying out-of-sample accuracy on the short-end of the yield curve, i.e. on yields of bonds with maturity in less than 2 years: for maturities of one year or less, we manage to get average out-of-sample accuracies above 60% and even up to 71%. We are rather satisfied with these figures and these models could be useful signals to implement trading strategies. However we could not implement a strategy as we could not access data on US bond prices.
It is more difficult to forecast the medium-end and long-end of the yield curve (yields of bonds with maturity above 1y): for these yields, we get an average out-of-sample accuracy between 0.53 and 0.59. This is less satisfying and our models provide a weaker signal on those yields. The lower predictive power of our models on the long-end of the yield curve confirms the lack of statistical significance of the forecasting power of the models on these yields.
Interestingly, the random forest sometimes performs better than logistic regression to forecast the long-end of the yield curve. This could suggest that there is some non-linear relationship between some features and the long-term yields that is not captured very well by a linear model like logistic regression.
Even if in some cases, random forests provide a higher out-of-sample accuracy than linear regression, this improvement is only marginal: for instance, to forecast the 5y yield, logistic regression has a 55% out-of-sample accuracy and random forest 56%. This improvement may be marginal compared to the increase in complexity and the loss of interpretability of the model.
For most yields, the datasets containing features with mutual information above 0.03 or 0.035 are the datasets providing the best trained models. These are the datasets containing a set of macroeconomic and financial variables, as well as functions of the past yields (lags, rolling means, quantiles, ...).
However, when it comes to the shortest maturity yield, ie the 1 month yield, we see that the dataset providing the trained model with the best out-of-sample accuracy is the dataset containing features with mutual information above 0.05. This dataset contains only functions of the past yield, so it is interesting to see that we can obtain a 65% out-of-sample accuracy to predict 1 month yields only with previous values of the yields. This could suggest that the trend-following and mean-reverting behavior of the 1 month yield is stronger than for other yields, and that it could be suited for trading strategies exploiting this behavior.
Another interesting thing is to look at values forecast by our models. By looking at the average proportion of ones forecast by our models, we can get some insights on the performance of our models.
forecast = {
"lr 0.03": pd.read_csv(
"results of models/forecast logistic regression, 15y train test.csv",
index_col=0,
).mean(),
"lr 0.03 pca": pd.read_csv(
"results of models/forecast logistic regression pca, 15y train test.csv",
index_col=0,
).mean(),
"lr 0.035": pd.read_csv(
"results of models/forecast logistic regression, MI 0.035, 15y train test.csv",
index_col=0,
).mean(),
"lr 0.04": pd.read_csv(
"results of models/forecast logistic regression, MI 0.04, 15y train test.csv",
index_col=0,
).mean(),
"lr 0.05": pd.read_csv(
"results of models/forecast logistic regression, MI 0.05, 15y train test.csv",
index_col=0,
).mean(),
"rf 0.03": pd.read_csv(
"results of models/forecast random forest, no pca, 15y train test.csv",
index_col=0,
).mean(),
"rf 0.03 pca": pd.read_csv(
"results of models/forecast random forest pca, 15y train test.csv", index_col=0
).mean(),
"rf 0.035": pd.read_csv(
"results of models/forecast random forest, MI 0.035, 15y train test.csv",
index_col=0,
).mean(),
"rf 0.04": pd.read_csv(
"results of models/forecast random forest, MI 0.04, 15y train test.csv",
index_col=0,
).mean(),
"rf 0.05": pd.read_csv(
"results of models/forecast random forest, MI 0.05, 15y train test.csv",
index_col=0,
).mean(),
"xg 0.03": pd.read_csv(
"results of models/forecast xgboost, 15y train test.csv", index_col=0
).mean(),
"xg 0.03 pca": pd.read_csv(
"results of models/forecast xgboost, pca, 15y train test.csv", index_col=0
).mean(),
"xg 0.035": pd.read_csv(
"results of models/forecast xgboost, MI 0.035, 15y train test.csv", index_col=0
).mean(),
"xg 0.04": pd.read_csv(
"results of models/forecast xgboost, MI 0.04, 15y train test.csv", index_col=0
).mean(),
"xg 0.05": pd.read_csv(
"results of models/forecast xgboost, MI 0.05, 15y train test.csv", index_col=0
).mean(),
}
forecast = pd.DataFrame(forecast)
forecast_lstm = pd.read_csv(
"results of models/forecast_LSTM_cls_12864128_0.02MI_2patience_52LB_evalsscores.csv",
index_col=0,
)
forecast_lstm = (forecast_lstm > 0.5).astype(int).mean()
forecast_lstm.name = "LSTM 0.02"
forecast_avg = pd.concat([forecast, forecast_lstm], axis=1).round(2)
forecast_avg.index = [
"1m yield",
"3m yield",
"6m yield",
"1y yield",
"2y yield",
"3y yield",
"5y yield",
"7y yield",
"10y yield",
"20y yield",
"30y yield",
]
forecast_avg = forecast_avg.add_prefix("Proportions of 1 forecast by ")
forecast_avg
| Proportions of 1 forecast by lr 0.03 | Proportions of 1 forecast by lr 0.03 pca | Proportions of 1 forecast by lr 0.035 | Proportions of 1 forecast by lr 0.04 | Proportions of 1 forecast by lr 0.05 | Proportions of 1 forecast by rf 0.03 | Proportions of 1 forecast by rf 0.03 pca | Proportions of 1 forecast by rf 0.035 | Proportions of 1 forecast by rf 0.04 | Proportions of 1 forecast by rf 0.05 | Proportions of 1 forecast by xg 0.03 | Proportions of 1 forecast by xg 0.03 pca | Proportions of 1 forecast by xg 0.035 | Proportions of 1 forecast by xg 0.04 | Proportions of 1 forecast by xg 0.05 | Proportions of 1 forecast by LSTM 0.02 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1m yield | 0.41 | 0.40 | 0.37 | 0.36 | 0.41 | 0.26 | 0.12 | 0.27 | 0.27 | 0.28 | 0.20 | 0.00 | 0.21 | 0.20 | 0.17 | 0.00 |
| 3m yield | 0.41 | 0.41 | 0.36 | 0.38 | 0.46 | 0.33 | 0.19 | 0.33 | 0.33 | 0.35 | 0.31 | 0.00 | 0.32 | 0.32 | 0.33 | 0.00 |
| 6m yield | 0.48 | 0.42 | 0.44 | 0.50 | 0.52 | 0.42 | 0.33 | 0.42 | 0.41 | 0.41 | 0.43 | 0.03 | 0.43 | 0.40 | 0.39 | 0.00 |
| 1y yield | 0.50 | 0.46 | 0.42 | 0.50 | 0.57 | 0.31 | 0.05 | 0.30 | 0.32 | 0.32 | 0.08 | 0.00 | 0.07 | 0.09 | 0.08 | 0.00 |
| 2y yield | 0.63 | 0.60 | 0.55 | 0.62 | 0.61 | 0.34 | 0.11 | 0.35 | 0.34 | 0.30 | 0.11 | 0.00 | 0.12 | 0.13 | 0.11 | 0.00 |
| 3y yield | 0.59 | 0.56 | 0.54 | 0.59 | 0.60 | 0.37 | 0.17 | 0.36 | 0.36 | 0.36 | 0.10 | 0.00 | 0.10 | 0.11 | 0.08 | 0.01 |
| 5y yield | 0.49 | 0.49 | 0.52 | 0.56 | 0.58 | 0.37 | 0.28 | 0.41 | 0.38 | 0.36 | 0.15 | 0.00 | 0.12 | 0.15 | 0.09 | 0.00 |
| 7y yield | 0.52 | 0.50 | 0.53 | 0.58 | 0.63 | 0.35 | 0.23 | 0.34 | 0.41 | 0.36 | 0.12 | 0.00 | 0.13 | 0.13 | 0.08 | 0.01 |
| 10y yield | 0.50 | 0.50 | 0.51 | 0.58 | 0.59 | 0.28 | 0.26 | 0.28 | 0.32 | 0.30 | 0.06 | 0.00 | 0.05 | 0.09 | 0.05 | 0.01 |
| 20y yield | 0.50 | 0.49 | 0.54 | 0.67 | 0.65 | 0.29 | 0.28 | 0.35 | 0.35 | 0.36 | 0.04 | 0.00 | 0.03 | 0.07 | 0.02 | 0.01 |
| 30y yield | 0.58 | 0.59 | 0.59 | 0.76 | 0.65 | 0.33 | 0.30 | 0.36 | 0.41 | 0.44 | 0.06 | 0.00 | 0.06 | 0.07 | 0.03 | 0.01 |
Let's compare the distribution of classes forecast by our models with the true distribution of classes in the data:
a
Y_DGS1MO 0.590476 Y_DGS3MO 0.555556 Y_DGS6MO 0.523810 Y_DGS1 0.555556 Y_DGS2 0.526984 Y_DGS3 0.507937 Y_DGS5 0.507937 Y_DGS7 0.504762 Y_DGS10 0.507937 Y_DGS20 0.523810 Y_DGS30 0.511111 dtype: float64
We can see that the classes predicted by the logistic regression and random forest are relatively well balanced and not too different to the true distribution of classes in data. However, the XGboost models predict a very small number of ones: the proportion of ones in the predictions of XGboost is often lower than 0.1. This is even truer for the LSTM model.
5.3. Ensemble LearningΒΆ
We will now try to implement ensemble learning models: we will implement an ensemble learning model that is specialized on forecasting the short-end of the yield curve. We won't try to build models for the other parts of the yield curve as the models that we have are already not statistically significatively performant.
top_columns_shortend = (
accuracies.head(4).mean(axis=0).sort_values(ascending=False).head(3).index.tolist()
)
print(
"for the short end of the yield curve we will use these predictors",
[col[35:] for col in top_columns_shortend],
)
for the short end of the yield curve we will use these predictors ['lr 0.035', 'lr 0.04', 'rf 0.05']
ensemble_short = pd.DataFrame(
index=Yw_binary.iloc[780 : Yw_binary.shape[0] - 1].index, columns=Yw_binary.columns
)
accura = accuracies[top_columns_shortend]
for col, col2 in zip(
Yw_binary.columns,
[
"1m yield",
"3m yield",
"6m yield",
"1y yield",
"2y yield",
"3y yield",
"5y yield",
"7y yield",
"10y yield",
"20y yield",
"30y yield",
],
):
weights = accura.loc[col2] / accura.loc[col2].sum()
weights = weights.to_dict()
forecast = {
key: df.rename(columns={col: key})
for key, df in {
"lr 0.035": pd.read_csv(
"results of models/forecast logistic regression, MI 0.035, 15y train test.csv",
index_col=0,
)[[col]],
"lr 0.04": pd.read_csv(
"results of models/forecast logistic regression, MI 0.04, 15y train test.csv",
index_col=0,
)[[col]],
"rf 0.05": pd.read_csv(
"results of models/forecast random forest, MI 0.05, 15y train test.csv",
index_col=0,
)[[col]],
}.items()
}
forecast_agg = pd.concat(forecast.values(), axis=1)
weighted_preds = forecast_agg.copy()
for model in forecast_agg.columns:
weighted_preds[model] = (
forecast_agg[model] * weights[f"average out-of-sample accuracy for {model}"]
)
forecast_agg["ensemble"] = weighted_preds.sum(axis=1)
ensemble_short[col] = (forecast_agg["ensemble"] > 0.5).astype(int)
Yw_binary_true = Yw_binary.iloc[780 : Yw_binary.shape[0] - 1]
accu = (ensemble_short == Yw_binary_true).mean().round(2)
accu.index = accuracies.index
accuracies[
"average out-of-sample accuracy for short end of yield curve ensemble model"
] = accu
We get an improvement of 1% of accuracy on some specific points of the yield curve by aggregating the top 3 predictors. However this is a marginal improvement that has been performed without cross validation and on the whole history of predictions so it could not have been used before.
best_models = pd.DataFrame(
{
"Best model": [col[35:] for col in accuracies.idxmax(axis=1)],
"Best average out-of-sample accuracy": accuracies.max(axis=1),
}
)
best_models
| Best model | Best average out-of-sample accuracy | |
|---|---|---|
| 1m yield | rf 0.05 | 0.65 |
| 3m yield | short end of yield curve ensemble model | 0.72 |
| 6m yield | lr 0.035 | 0.68 |
| 1y yield | lr 0.035 | 0.62 |
| 2y yield | lr 0.03 pca | 0.59 |
| 3y yield | rf 0.05 | 0.56 |
| 5y yield | short end of yield curve ensemble model | 0.57 |
| 7y yield | lr 0.03 | 0.55 |
| 10y yield | lr 0.03 | 0.54 |
| 20y yield | rf 0.03 pca | 0.53 |
| 30y yield | lr 0.03 | 0.53 |
When running the McNemar test, we can see that this model has a statistically significant performance over short term yields, with a p-value below 0.1 for 1 month, 3 months and 6 months yields.
pval_ensemble = pd.DataFrame(
columns=majority_class_classifier.columns, index=["pvalue"]
)
for col in majority_class_classifier.columns:
y_pred = ensemble_short[col]
y_benchmark = majority_class_classifier[col]
y_true = Yw_binary.iloc[780 : Yw_binary.shape[0] - 1][col]
# McNemar test
correct_real = y_pred == y_true
correct_benchmark = y_benchmark == y_true
table = confusion_matrix(correct_real, correct_benchmark)
result = mcnemar(table, exact=False, correction=True)
pval_ensemble.loc["pvalue", col] = result.pvalue
pval_ensemble
| Y_DGS1MO | Y_DGS3MO | Y_DGS6MO | Y_DGS1 | Y_DGS2 | Y_DGS3 | Y_DGS5 | Y_DGS7 | Y_DGS10 | Y_DGS20 | Y_DGS30 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| pvalue | 0.076131 | 0.000003 | 0.000043 | 0.115796 | 0.476767 | 0.601845 | 0.101325 | 0.338212 | 0.747992 | 0.768692 | 0.476173 |
5.4. Evolution of the performance of the modelsΒΆ
We will plot the evolution through time of the 3 months rolling average out-of-sample accuracy of the models. This will allow us to see if the models are consistently performing through time or if they are only reliable at some specific periods of time. We will focus on the models forecasting the short-term yields as it is where they are the most successful. We will only focus on the models that are statistically significant. They are given below: there is a NaN in the dataframe if the model is not statistically significant:
acc_sig.head(4)
| lr 0.03 | lr 0.03 pca | lr 0.035 | lr 0.04 | lr 0.05 | rf 0.03 | rf 0.03 pca | rf 0.035 | rf 0.04 | rf 0.05 | xg 0.03 | xg 0.035 | xg 0.04 | xg 0.05 | LSTM 0.02 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1m yield | NaN | NaN | NaN | NaN | NaN | 0.64 | NaN | NaN | NaN | 0.64 | 0.59 | NaN | NaN | NaN | 0.63 |
| 3m yield | 0.68 | 0.67 | 0.71 | 0.71 | 0.65 | 0.70 | 0.7 | 0.70 | 0.69 | 0.68 | 0.55 | 0.69 | 0.69 | 0.68 | 0.62 |
| 6m yield | 0.64 | 0.63 | 0.68 | 0.65 | 0.64 | 0.67 | NaN | 0.67 | 0.67 | 0.66 | 0.50 | 0.68 | 0.67 | 0.66 | NaN |
| 1y yield | NaN | NaN | 0.62 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.52 | NaN | 0.57 |
col = "Y_DGS1MO"
forecast = {
"rf 0.03": pd.read_csv(
"results of models/forecast random forest, no pca, 15y train test.csv",
index_col=0,
)[col],
"rf 0.04": pd.read_csv(
"results of models/forecast random forest, MI 0.04, 15y train test.csv",
index_col=0,
)[col],
"rf 0.05": pd.read_csv(
"results of models/forecast random forest, MI 0.05, 15y train test.csv",
index_col=0,
)[col],
"xg 0.03": pd.read_csv(
"results of models/forecast xgboost, 15y train test.csv", index_col=0
)[col],
}
forecast = pd.DataFrame(forecast)
forecast_lstm = pd.read_csv(
"results of models/forecast_LSTM_cls_12864128_0.02MI_2patience_52LB_evalsscores.csv",
index_col=0,
)[col]
forecast_lstm = (forecast_lstm > 0.5).astype(int)
forecast_lstm.name = "LSTM 0.02"
ensemble = ensemble_short[col]
ensemble.name = "Ensemble"
bench = majority_class_classifier[col]
bench.name = "Majority class classifier benchmark"
forecast = pd.concat([forecast, forecast_lstm, ensemble, bench], axis=1).round(2)
forecast.index = Yw_binary.iloc[780 : Yw_binary.shape[0]].index
true_values = Yw_binary.iloc[780 : Yw_binary.shape[0]][col]
fig, ax = plt.subplots(figsize=(12, 6))
for model in forecast.columns:
correct = forecast[model] == true_values
rolling_acc = correct.rolling(window=24).mean()
if model == "Majority class classifier benchmark":
rolling_acc.plot(ax=ax, label=model, color="black")
else:
rolling_acc.plot(ax=ax, label=model)
ax.grid(axis="y", color="lightgrey", linestyle="--", linewidth=1)
plt.title(
"Rolling average out-of-sample accuracy (over 24 weeks) of statistically significant models predicting the 1 month yield"
)
plt.xlabel("Date")
plt.ylabel("Accuracy")
plt.legend(loc="center left", bbox_to_anchor=(1, 0.5))
plt.tight_layout()
plt.show()
We can see from the plot above that the majority class classifier benchmark (black line) is difficult to beat consistently through time:
- on the period 2018-2021, the benchmark provides almost systematically an rolling accuracy that is as good as that of the other models learning the 1 month yield. This suggest that models are not learning much things from our features on this period.
- after 2021, the models start beating the benchmark. This is probably the effect of the rates hikes after the covid pandemic. The models have probably learned something from the macro and monetary variables included in the dataset at that time.
- We can see that our models have a rolling average out-of-sample accuracy that is most of the time above 0.5, but this rolling average is quite volatile, which shows that there are periods in which the models perform very poorly.
col = "Y_DGS3MO"
forecast = {
"lr 0.03": pd.read_csv(
"results of models/forecast logistic regression, 15y train test.csv",
index_col=0,
)[col],
"lr 0.03 pca": pd.read_csv(
"results of models/forecast logistic regression pca, 15y train test.csv",
index_col=0,
)[col],
"lr 0.035": pd.read_csv(
"results of models/forecast logistic regression, MI 0.035, 15y train test.csv",
index_col=0,
)[col],
"lr 0.04": pd.read_csv(
"results of models/forecast logistic regression, MI 0.04, 15y train test.csv",
index_col=0,
)[col],
"lr 0.05": pd.read_csv(
"results of models/forecast logistic regression, MI 0.05, 15y train test.csv",
index_col=0,
)[col],
"rf 0.03": pd.read_csv(
"results of models/forecast random forest, no pca, 15y train test.csv",
index_col=0,
)[col],
"rf 0.03 pca": pd.read_csv(
"results of models/forecast random forest pca, 15y train test.csv", index_col=0
)[col],
"rf 0.035": pd.read_csv(
"results of models/forecast random forest, MI 0.035, 15y train test.csv",
index_col=0,
)[col],
"rf 0.04": pd.read_csv(
"results of models/forecast random forest, MI 0.04, 15y train test.csv",
index_col=0,
)[col],
"rf 0.05": pd.read_csv(
"results of models/forecast random forest, MI 0.05, 15y train test.csv",
index_col=0,
)[col],
"xg 0.03": pd.read_csv(
"results of models/forecast xgboost, 15y train test.csv", index_col=0
)[col],
"xg 0.03 pca": pd.read_csv(
"results of models/forecast xgboost, pca, 15y train test.csv", index_col=0
)[col],
"xg 0.035": pd.read_csv(
"results of models/forecast xgboost, MI 0.035, 15y train test.csv", index_col=0
)[col],
"xg 0.04": pd.read_csv(
"results of models/forecast xgboost, MI 0.04, 15y train test.csv", index_col=0
)[col],
"xg 0.05": pd.read_csv(
"results of models/forecast xgboost, MI 0.05, 15y train test.csv", index_col=0
)[col],
}
forecast = pd.DataFrame(forecast)
forecast_lstm = pd.read_csv(
"results of models/forecast_LSTM_cls_12864128_0.02MI_2patience_52LB_evalsscores.csv",
index_col=0,
)[col]
forecast_lstm = (forecast_lstm > 0.5).astype(int)
forecast_lstm.name = "LSTM 0.02"
ensemble = ensemble_short[col]
ensemble.name = "Ensemble"
bench = majority_class_classifier[col]
bench.name = "Majority class classifier benchmark"
forecast = pd.concat([forecast, forecast_lstm, ensemble, bench], axis=1).round(2)
forecast.index = Yw_binary.iloc[780 : Yw_binary.shape[0]].index
true_values = Yw_binary.iloc[780 : Yw_binary.shape[0]][col]
fig, ax = plt.subplots(figsize=(14, 6))
style_cycle = cycle(["-", "--", ":"])
for model in forecast.columns:
correct = forecast[model] == true_values
rolling_acc = correct.rolling(window=24).mean()
if model == "Majority class classifier benchmark":
rolling_acc.plot(ax=ax, label=model, color="black")
else:
rolling_acc.plot(ax=ax, label=model, linestyle=next(style_cycle))
ax.grid(axis="y", color="lightgrey", linestyle="--", linewidth=1)
plt.title(
"Rolling average out-of-sample accuracy (over 24 weeks) of statistically significant models predicting the 3 month yield"
)
plt.xlabel("Date")
plt.ylabel("Accuracy")
plt.legend(loc="center left", bbox_to_anchor=(1, 0.5))
plt.tight_layout()
plt.show()
The results are more satisfactory for the 3-month yields, but the overall pattern remains similar to that of the models for the 1-month yields: a significant number of models manage to outperform the benchmark starting from late 2021. Before that, the benchmark delivers performance that is generally quite similar to the models. We also observe some volatility in the modelsβ accuracy. The logistic regression models appear to be slightly more stable than the others, as does the ensemble model.
col = "Y_DGS6MO"
forecast = {
"lr 0.03": pd.read_csv(
"results of models/forecast logistic regression, 15y train test.csv",
index_col=0,
)[col],
"lr 0.03 pca": pd.read_csv(
"results of models/forecast logistic regression pca, 15y train test.csv",
index_col=0,
)[col],
"lr 0.035": pd.read_csv(
"results of models/forecast logistic regression, MI 0.035, 15y train test.csv",
index_col=0,
)[col],
"lr 0.04": pd.read_csv(
"results of models/forecast logistic regression, MI 0.04, 15y train test.csv",
index_col=0,
)[col],
"lr 0.05": pd.read_csv(
"results of models/forecast logistic regression, MI 0.05, 15y train test.csv",
index_col=0,
)[col],
"rf 0.03": pd.read_csv(
"results of models/forecast random forest, no pca, 15y train test.csv",
index_col=0,
)[col],
"rf 0.03 pca": pd.read_csv(
"results of models/forecast random forest pca, 15y train test.csv", index_col=0
)[col],
"rf 0.035": pd.read_csv(
"results of models/forecast random forest, MI 0.035, 15y train test.csv",
index_col=0,
)[col],
"rf 0.04": pd.read_csv(
"results of models/forecast random forest, MI 0.04, 15y train test.csv",
index_col=0,
)[col],
"rf 0.05": pd.read_csv(
"results of models/forecast random forest, MI 0.05, 15y train test.csv",
index_col=0,
)[col],
"xg 0.03": pd.read_csv(
"results of models/forecast xgboost, 15y train test.csv", index_col=0
)[col],
"xg 0.03 pca": pd.read_csv(
"results of models/forecast xgboost, pca, 15y train test.csv", index_col=0
)[col],
"xg 0.035": pd.read_csv(
"results of models/forecast xgboost, MI 0.035, 15y train test.csv", index_col=0
)[col],
"xg 0.04": pd.read_csv(
"results of models/forecast xgboost, MI 0.04, 15y train test.csv", index_col=0
)[col],
"xg 0.05": pd.read_csv(
"results of models/forecast xgboost, MI 0.05, 15y train test.csv", index_col=0
)[col],
}
forecast = pd.DataFrame(forecast)
ensemble = ensemble_short[col]
ensemble.name = "Ensemble"
bench = majority_class_classifier[col]
bench.name = "Majority class classifier benchmark"
forecast = pd.concat([forecast, ensemble, bench], axis=1).round(2)
forecast.index = Yw_binary.iloc[780 : Yw_binary.shape[0]].index
true_values = Yw_binary.iloc[780 : Yw_binary.shape[0]][col]
fig, ax = plt.subplots(figsize=(14, 6))
style_cycle = cycle(["-", "--", ":"])
for model in forecast.columns:
correct = forecast[model] == true_values
rolling_acc = correct.rolling(window=24).mean()
if model == "Majority class classifier benchmark":
rolling_acc.plot(ax=ax, label=model, color="black")
else:
rolling_acc.plot(ax=ax, label=model, linestyle=next(style_cycle))
ax.grid(axis="y", color="lightgrey", linestyle="--", linewidth=1)
plt.title(
"Rolling average out-of-sample accuracy (over 24 weeks) of statistically significant models predicting the 6 month yield"
)
plt.xlabel("Date")
plt.ylabel("Accuracy")
plt.legend(loc="center left", bbox_to_anchor=(1, 0.5))
plt.tight_layout()
plt.show()
col = "Y_DGS1"
forecast = {
"lr 0.035": pd.read_csv(
"results of models/forecast logistic regression, MI 0.035, 15y train test.csv",
index_col=0,
)[col],
"xg 0.035": pd.read_csv(
"results of models/forecast xgboost, MI 0.035, 15y train test.csv", index_col=0
)[col],
"xg 0.04": pd.read_csv(
"results of models/forecast xgboost, MI 0.04, 15y train test.csv", index_col=0
)[col],
}
forecast = pd.DataFrame(forecast)
forecast_lstm = pd.read_csv(
"results of models/forecast_LSTM_cls_12864128_0.02MI_2patience_52LB_evalsscores.csv",
index_col=0,
)[col]
forecast_lstm = (forecast_lstm > 0.5).astype(int)
forecast_lstm.name = "LSTM 0.02"
bench = majority_class_classifier[col]
bench.name = "Majority class classifier benchmark"
forecast = pd.concat([forecast, forecast_lstm, bench], axis=1).round(2)
forecast.index = Yw_binary.iloc[780 : Yw_binary.shape[0]].index
true_values = Yw_binary.iloc[780 : Yw_binary.shape[0]][col]
fig, ax = plt.subplots(figsize=(14, 6))
style_cycle = cycle(["-", "--", ":"])
for model in forecast.columns:
correct = forecast[model] == true_values
rolling_acc = correct.rolling(window=24).mean()
if model == "Majority class classifier benchmark":
rolling_acc.plot(ax=ax, label=model, color="black")
else:
rolling_acc.plot(ax=ax, label=model, linestyle=next(style_cycle))
ax.grid(axis="y", color="lightgrey", linestyle="--", linewidth=1)
plt.title(
"Rolling average out-of-sample accuracy (over 24 weeks) of statistically significant models predicting the 1 year yield"
)
plt.xlabel("Date")
plt.ylabel("Accuracy")
plt.legend(loc="center left", bbox_to_anchor=(1, 0.5))
plt.tight_layout()
plt.show()
Once again the benchmark is difficult to beat consistently.
6. ConclusionΒΆ
This project aimed to predict the yield curve of US government bonds using macroeconomic data from the FED, market variables, and functions of past yields. Our objective was to compare linear models with complex non-linear models to see if the relationships between financial variables are linear, or if there are significant nonlinearities allowing complex models to outperform simple linear models.
Initially, we attempted to perform regression on the daily yields difference. The penalized linear models (Ridge, Lasso, and Elastic Net) failed to identify a relationship between the yields and the features used. Performing a PCA on the dataset to reduce the number of features and their correlation did not improve the predictions: the out-of-sample $\text{R}^2$ is systematically close to zero or negative, and the in-sample $\text{R}^2$ oscillates between 0 and 0.4, meaning that the features do not effectively explain the variance of the yields, even in-sample. Given this finding, we decided to switch to binary classification models on weekly data: our second objective was to predict whether the yields would increase ($Y=1$) or decrease ($Y=0$) each week.
For this binary classification task, we trained our models on several feature datasets: a first dataset contains all features that have a shared mutual information with the Y greater than 0.03, a second greater than 0.035, a third greater than 0.04, and a last one greater than 0.05. As this mutual information threshold increases, the number of features in the dataset decreases: in particular, for the datasets with mutual information greater than 0.04 and 0.05, there are almost no macroeconomic and market features left, and all features contained in this dataset are functions of past yields. The idea of training models on different feature datasets was thus to see whether the models could successfully select features in a high-dimensional dataset, or if it was better to filter only the best features with the highest mutual information, even at the cost of sacrificing macroeconomic and financial variables. The models trained were the following: logistic regression, Random Forest, XGBoost, and LSTM. We trained these models on a total of 4 feature datasets ($\text{X}$) and on 11 different target variables ($\text{Y}$), specifically the yield variations for 1 month, 3 months, 6 months, $\dots$, up to 30 years. This resulted in a total of 171 models. These 171 models were successively re-trained on 15-year rolling periods and predicted the yield variations for the subsequent four weeks. Our main findings regarding these classifier models are the following:
- When performing McNemar's tests to determine if our models have statistically significant predictive power, it appears that it is primarily the models predicting short-term yields (less than one year) that statistically outperform the "majority class classifier" benchmark.
- Logistic Regression models perform well on short-term yields and often provide the best average out-of-sample accuracy, which reaches up to 71% for 3-month yields. In some cases, Random Forest models provide a marginal improvement in accuracy compared to Logistic Regression and seem to manage to capture non-linearity in the data. However, this comes at the cost of model interpretability: we can therefore be satisfied with logistic regressions as they effectively exploit the predictive power of the features while remaining simple models.
- XGBoost and LSTM, which are the most complex models, tend to predict only zeros or very close to it. These models overfit the data massively and are very ineffective.
- For long-term yields, the results from all models are disappointing, and they rarely perform statistically better than the "majority class classifier" benchmark: this suggests that the models struggle to learn anything from the features. This is not surprising as long-term yields depend less on macroeconomic variables than short-term yields.
- The dataframes that yield the best results are generally those containing a mix of macroeconomic data, financial data, and functions of past yields (mutual information $>0.035$), which confirms that the FED's macro data is useful for predicting yields. However, for 1-month yields, the best dataset for predicting these yields is the one that contains only functions of past yields (Mutual information $> 0.05$): an out-of-sample accuracy of $\mathbf{0.65}$ can be achieved solely with functions of past yields, which illustrates the strong trend-following and mean-reverting nature of these yields.
Future extensions of this project could include:
- Further work on ensemble learning models, which seem quite promising for aggregating the results of our trained models.
- An analysis of the evolution of feature importance over time, to understand which features are identified as the best by the models in each rolling training period. This could allow for more advanced feature engineering and focus on the most promising features.
- The implementation of a trading strategy on US bonds with maturity less than one year, based on this signal. For this, we will need the bond prices.